Introduction
Score-based generative models provide an alternative perspective on diffusion that focuses on the gradient of the log-probability density rather than the density itself. This viewpoint, developed primarily by Yang Song and collaborators, offers elegant theoretical foundations and powerful sampling techniques that unify various generative modeling approaches.
Understanding score-based models reveals why denoising diffusion works and opens pathways to improved training objectives, more efficient samplers, and connections to other areas of machine learning. This section develops the score-based perspective from first principles, showing how it naturally leads to diffusion models.
The Score Function
The score function is the central object in score-based generative modeling. Rather than modeling the probability density directly (which requires knowing an intractable normalization constant), we model its gradient.
Gradient of Log-Density
The score function is defined as the gradient of the log-density with respect to the data:
Score Function Definition:The score tells us the direction and magnitude of the steepest increase in log-probability at any point .
This definition has a crucial advantage: the normalization constant vanishes when we take the gradient. If where is the normalizing constant, then:
1# Mathematical derivation showing normalization cancels
2"""
3p(x) = (1/Z) * p_tilde(x)
4
5log p(x) = log p_tilde(x) - log Z
6
7nabla_x log p(x) = nabla_x log p_tilde(x) - nabla_x log Z
8 = nabla_x log p_tilde(x) - 0
9 = nabla_x log p_tilde(x)
10
11The gradient of log Z is zero because Z doesn't depend on x!
12"""
13
14# For a Gaussian: p(x) = (1/sqrt(2*pi*sigma^2)) * exp(-x^2 / (2*sigma^2))
15# Score: s(x) = -x / sigma^2
16
17import torch
18
19def gaussian_score(x: torch.Tensor, sigma: float = 1.0) -> torch.Tensor:
20 """Analytical score for Gaussian distribution."""
21 return -x / (sigma ** 2)
22
23# For mixture of Gaussians (more complex)
24def mixture_score(x: torch.Tensor, means: torch.Tensor, weights: torch.Tensor, sigma: float) -> torch.Tensor:
25 """Score for Gaussian mixture - requires computing density first."""
26 # Component densities
27 diffs = x.unsqueeze(-1) - means # [..., K]
28 log_probs = -0.5 * (diffs ** 2) / (sigma ** 2)
29 probs = torch.softmax(log_probs + torch.log(weights), dim=-1)
30
31 # Weighted sum of component scores
32 component_scores = -diffs / (sigma ** 2)
33 return (probs * component_scores).sum(dim=-1)Geometric Intuition
The score function has a beautiful geometric interpretation: at every point in space, it points toward regions of higher probability. Following the score leads to the modes of the distribution:
| Property | Score Function Behavior | Interpretation |
|---|---|---|
| At mode | s(x) = 0 | Gradient vanishes at maximum |
| Near mode | Points toward mode | Gradient descent reaches mode |
| Low density region | Large magnitude | Strong push toward high density |
| Between modes | Points to nearest mode | Basins of attraction |
For a multivariate distribution, the score is a vector field. Visualizing this field reveals the structure of the distribution: modes appear as sinks (where vectors converge), and saddle points create separations between basins.
1import torch
2import matplotlib.pyplot as plt
3import numpy as np
4
5def visualize_score_field(
6 score_fn,
7 xlim: tuple = (-3, 3),
8 ylim: tuple = (-3, 3),
9 resolution: int = 20
10):
11 """Visualize a 2D score field as a vector plot."""
12 # Create grid
13 x = torch.linspace(xlim[0], xlim[1], resolution)
14 y = torch.linspace(ylim[0], ylim[1], resolution)
15 X, Y = torch.meshgrid(x, y, indexing='xy')
16 points = torch.stack([X.flatten(), Y.flatten()], dim=-1)
17
18 # Compute scores
19 scores = score_fn(points)
20 U = scores[:, 0].reshape(resolution, resolution)
21 V = scores[:, 1].reshape(resolution, resolution)
22
23 # Plot
24 plt.figure(figsize=(8, 8))
25 plt.quiver(
26 X.numpy(), Y.numpy(),
27 U.numpy(), V.numpy(),
28 np.sqrt(U.numpy()**2 + V.numpy()**2), # Color by magnitude
29 cmap='viridis'
30 )
31 plt.colorbar(label='Score magnitude')
32 plt.title('Score Field Visualization')
33 plt.xlabel('x')
34 plt.ylabel('y')
35 return plt.gcf()
36
37
38# Example: Score field for mixture of two Gaussians
39def two_mode_score(x: torch.Tensor) -> torch.Tensor:
40 """Score for mixture of two 2D Gaussians."""
41 mean1 = torch.tensor([-1.0, 0.0])
42 mean2 = torch.tensor([1.0, 0.0])
43 sigma = 0.5
44
45 # Compute responsibilities (which Gaussian each point belongs to)
46 d1 = ((x - mean1) ** 2).sum(dim=-1)
47 d2 = ((x - mean2) ** 2).sum(dim=-1)
48 log_p1 = -d1 / (2 * sigma ** 2)
49 log_p2 = -d2 / (2 * sigma ** 2)
50
51 w1 = torch.softmax(torch.stack([log_p1, log_p2], dim=-1), dim=-1)[..., 0]
52 w2 = 1 - w1
53
54 # Weighted combination of scores
55 score1 = -(x - mean1) / (sigma ** 2)
56 score2 = -(x - mean2) / (sigma ** 2)
57
58 return w1.unsqueeze(-1) * score1 + w2.unsqueeze(-1) * score2Score Matching
Given that the score function is valuable, how do we learn it from data? The naive approach of computing fails because we don't know . Score matching provides an elegant solution.
Fisher Divergence
We want our model score to match the true score. A natural objective is the Fisher divergence:
Fisher Divergence:
This measures the expected squared difference between our model score and the true score. However, it still contains the unknown . The key insight is that we can expand and simplify:
1"""
2Fisher Divergence Expansion:
3
4D_F = (1/2) * E_p[ ||s_theta(x) - nabla log p(x)||^2 ]
5
6 = (1/2) * E_p[ ||s_theta(x)||^2 - 2 * s_theta(x)^T * nabla log p(x) + ||nabla log p(x)||^2 ]
7
8The last term doesn't depend on theta, so for optimization:
9
10J_SM(theta) = E_p[ (1/2) * ||s_theta(x)||^2 + trace(nabla_x s_theta(x)) ]
11
12This is the SCORE MATCHING objective - no ground truth score needed!
13The trace term comes from integration by parts (Stein's identity).
14"""
15
16def score_matching_loss(
17 score_model: nn.Module,
18 x: torch.Tensor
19) -> torch.Tensor:
20 """
21 Explicit score matching loss.
22 Requires computing Jacobian trace, which is expensive.
23 """
24 x = x.requires_grad_(True)
25
26 # Compute score
27 score = score_model(x)
28
29 # Compute trace of Jacobian (sum of diagonal elements)
30 trace = 0.0
31 for i in range(x.shape[-1]):
32 grad_i = torch.autograd.grad(
33 score[:, i].sum(),
34 x,
35 create_graph=True
36 )[0][:, i]
37 trace = trace + grad_i
38
39 # Score matching objective
40 loss = 0.5 * (score ** 2).sum(dim=-1) + trace
41
42 return loss.mean()Implicit Score Matching
The explicit score matching objective requires computing the trace of the Jacobian, which is expensive. Implicit score matching avoids this by exploiting the structure of the score function:
1def sliced_score_matching_loss(
2 score_model: nn.Module,
3 x: torch.Tensor,
4 num_projections: int = 1
5) -> torch.Tensor:
6 """
7 Sliced score matching: estimate trace using random projections.
8 Much more efficient than full Jacobian computation.
9 """
10 x = x.requires_grad_(True)
11 score = score_model(x)
12
13 total_loss = 0.0
14
15 for _ in range(num_projections):
16 # Random projection vector
17 v = torch.randn_like(x)
18 v = v / v.norm(dim=-1, keepdim=True)
19
20 # Compute v^T * score
21 score_v = (score * v).sum(dim=-1)
22
23 # Compute v^T * (Jacobian * v) = directional derivative
24 grad_score_v = torch.autograd.grad(
25 score_v.sum(),
26 x,
27 create_graph=True
28 )[0]
29
30 # v^T * nabla * (v^T * s) gives trace estimate
31 jacobian_trace_estimate = (grad_score_v * v).sum(dim=-1)
32
33 # Loss for this projection
34 loss = 0.5 * score_v ** 2 + jacobian_trace_estimate
35 total_loss = total_loss + loss
36
37 return total_loss.mean() / num_projectionsSliced score matching estimates the trace using random projections, reducing the cost from to where is the data dimension.
Denoising Score Matching
The most practical score matching variant is denoising score matching(DSM), which has a direct connection to diffusion models. Instead of matching the score of the data distribution, we match the score of a noise-perturbed distribution.
The DSM Objective
Given clean data , we add Gaussian noise to create where . The key insight is that we know the score of the noisy distribution analytically:
DSM Key Identity:The score of the conditional points from the noisy sample back toward the clean data!
1def denoising_score_matching_loss(
2 score_model: nn.Module,
3 x: torch.Tensor,
4 sigma: float
5) -> torch.Tensor:
6 """
7 Denoising Score Matching loss.
8 This is exactly the diffusion training objective!
9
10 Args:
11 score_model: s_theta(x, sigma) predicts score at noise level sigma
12 x: Clean data samples
13 sigma: Noise standard deviation
14 """
15 # Add noise
16 epsilon = torch.randn_like(x)
17 x_noisy = x + sigma * epsilon
18
19 # Model predicts score at noise level sigma
20 score_pred = score_model(x_noisy, sigma)
21
22 # True score of p(x_noisy | x) = -epsilon / sigma
23 score_true = -epsilon / sigma
24
25 # L2 loss
26 loss = 0.5 * ((score_pred - score_true) ** 2).sum(dim=-1)
27
28 return loss.mean()
29
30
31# Alternatively, predict noise directly (epsilon-prediction)
32def denoising_score_matching_epsilon(
33 epsilon_model: nn.Module,
34 x: torch.Tensor,
35 sigma: float
36) -> torch.Tensor:
37 """
38 DSM with epsilon-prediction (equivalent formulation).
39 This is the standard diffusion training objective.
40 """
41 epsilon = torch.randn_like(x)
42 x_noisy = x + sigma * epsilon
43
44 # Model predicts epsilon
45 epsilon_pred = epsilon_model(x_noisy, sigma)
46
47 # Simple MSE on noise
48 loss = ((epsilon_pred - epsilon) ** 2).sum(dim=-1)
49
50 return loss.mean()DSM Equivalences:
- Score prediction: s_theta = -epsilon / sigma, optimize ||s_theta - true_score||^2
- Epsilon prediction: epsilon_theta = -sigma * s_theta, optimize ||epsilon_theta - epsilon||^2
- x0 prediction: x0_theta = x_noisy + sigma^2 * s_theta, optimize ||x0_theta - x||^2
- All three are mathematically equivalent, differing only by scaling
Connection to Diffusion Models
Denoising score matching reveals why diffusion models work: training a diffusion model is exactly training a score estimator. The noise prediction objective is equivalent to learning the score of the noisy data distribution at each noise level.
This connection provides theoretical grounding for diffusion training:
- The score of can be computed analytically given the clean data .
- Training minimizes Fisher divergence between model and true conditional score.
- At convergence, .
- Sampling uses the learned score to reverse the noising process.
Noise-Conditional Score Networks
Real data distributions have complex structures that are hard to learn at a single noise level. Score-based models address this through noise-conditionalnetworks that learn scores across multiple noise scales.
Multi-Scale Score Estimation
The idea is to define a sequence of noise levels and train a single network to estimate scores at all levels:
1class NoiseConditionalScoreNetwork(nn.Module):
2 """
3 Score network conditioned on noise level.
4 This is the architecture behind NCSN and diffusion models.
5 """
6 def __init__(
7 self,
8 data_dim: int,
9 hidden_dim: int = 256,
10 num_layers: int = 4,
11 num_noise_levels: int = 10,
12 sigma_min: float = 0.01,
13 sigma_max: float = 10.0
14 ):
15 super().__init__()
16
17 # Noise level schedule (geometric)
18 self.register_buffer(
19 'sigmas',
20 torch.exp(torch.linspace(
21 np.log(sigma_max),
22 np.log(sigma_min),
23 num_noise_levels
24 ))
25 )
26
27 # Noise level embedding
28 self.noise_embedding = nn.Embedding(num_noise_levels, hidden_dim)
29
30 # Score network
31 layers = []
32 layers.append(nn.Linear(data_dim + hidden_dim, hidden_dim))
33 layers.append(nn.SiLU())
34
35 for _ in range(num_layers - 2):
36 layers.append(nn.Linear(hidden_dim, hidden_dim))
37 layers.append(nn.SiLU())
38
39 layers.append(nn.Linear(hidden_dim, data_dim))
40
41 self.network = nn.Sequential(*layers)
42
43 def forward(
44 self,
45 x: torch.Tensor,
46 noise_level_idx: torch.Tensor
47 ) -> torch.Tensor:
48 """
49 Predict score at given noise level.
50
51 Args:
52 x: Input samples [B, D]
53 noise_level_idx: Noise level indices [B]
54
55 Returns:
56 Predicted score [B, D]
57 """
58 # Get noise embedding
59 noise_emb = self.noise_embedding(noise_level_idx) # [B, hidden_dim]
60
61 # Concatenate and predict
62 combined = torch.cat([x, noise_emb], dim=-1)
63 score = self.network(combined)
64
65 # Scale output by 1/sigma for numerical stability
66 sigma = self.sigmas[noise_level_idx].unsqueeze(-1)
67 return score / sigma
68
69
70def train_ncsn(
71 model: NoiseConditionalScoreNetwork,
72 dataloader: DataLoader,
73 optimizer: torch.optim.Optimizer,
74 num_epochs: int = 100
75):
76 """Train noise-conditional score network."""
77 model.train()
78
79 for epoch in range(num_epochs):
80 total_loss = 0.0
81
82 for batch in dataloader:
83 x = batch['data']
84 B = x.shape[0]
85
86 # Sample random noise levels
87 noise_idx = torch.randint(0, len(model.sigmas), (B,))
88 sigma = model.sigmas[noise_idx].unsqueeze(-1)
89
90 # Add noise
91 epsilon = torch.randn_like(x)
92 x_noisy = x + sigma * epsilon
93
94 # Predict score
95 score_pred = model(x_noisy, noise_idx)
96
97 # DSM loss (with sigma^2 weighting for stability)
98 score_true = -epsilon / sigma
99 loss = 0.5 * ((score_pred - score_true) ** 2 * sigma ** 2).sum(dim=-1)
100
101 optimizer.zero_grad()
102 loss.mean().backward()
103 optimizer.step()
104
105 total_loss += loss.mean().item()
106
107 print(f"Epoch {epoch}: Loss = {total_loss / len(dataloader):.4f}")The geometric noise schedule ensures coverage of both fine details (small ) and global structure (large ). Weighting by in the loss balances contributions across scales.
Langevin Dynamics Sampling
Once we have a trained score network, we can generate samples using Langevin dynamics: an iterative process that follows the score while adding noise to ensure proper exploration of the distribution.
The Langevin Algorithm
Langevin dynamics is a first-order stochastic differential equation that converges to samples from given only the score:
Langevin Dynamics:where and is the step size. As and , samples converge to the target distribution.
1def langevin_dynamics(
2 score_fn: Callable[[torch.Tensor], torch.Tensor],
3 x_init: torch.Tensor,
4 step_size: float = 0.01,
5 num_steps: int = 1000,
6 noise_scale: float = 1.0
7) -> torch.Tensor:
8 """
9 Sample using Langevin dynamics.
10
11 Args:
12 score_fn: Function that returns score at x
13 x_init: Initial samples
14 step_size: Step size alpha
15 num_steps: Number of Langevin steps
16 noise_scale: Scale for noise (1.0 for exact Langevin)
17
18 Returns:
19 Final samples after Langevin iterations
20 """
21 x = x_init.clone()
22
23 for t in range(num_steps):
24 # Score
25 score = score_fn(x)
26
27 # Langevin update
28 noise = torch.randn_like(x)
29 x = x + (step_size / 2) * score + np.sqrt(step_size) * noise_scale * noise
30
31 return x
32
33
34def langevin_dynamics_with_trajectory(
35 score_fn: Callable[[torch.Tensor], torch.Tensor],
36 x_init: torch.Tensor,
37 step_size: float = 0.01,
38 num_steps: int = 1000,
39 save_every: int = 10
40) -> List[torch.Tensor]:
41 """Langevin dynamics saving intermediate states for visualization."""
42 x = x_init.clone()
43 trajectory = [x.clone()]
44
45 for t in range(num_steps):
46 score = score_fn(x)
47 noise = torch.randn_like(x)
48 x = x + (step_size / 2) * score + np.sqrt(step_size) * noise
49
50 if (t + 1) % save_every == 0:
51 trajectory.append(x.clone())
52
53 return trajectoryPure Langevin dynamics works well for simple distributions but struggles with multi-modal distributions where the score is unreliable in low-density regions between modes.
Annealed Langevin Dynamics
Annealed Langevin dynamics (ALD) addresses the multi-modal problem by starting at high noise levels (where modes are connected) and gradually decreasing noise:
1def annealed_langevin_dynamics(
2 score_model: NoiseConditionalScoreNetwork,
3 num_samples: int,
4 data_dim: int,
5 steps_per_level: int = 100,
6 step_size_factor: float = 2.0,
7 device: str = 'cuda'
8) -> torch.Tensor:
9 """
10 Annealed Langevin dynamics for sampling.
11 This is the original NCSN sampling algorithm.
12
13 Args:
14 score_model: Trained noise-conditional score network
15 num_samples: Number of samples to generate
16 data_dim: Dimension of data
17 steps_per_level: Langevin steps at each noise level
18 step_size_factor: Multiplier for step size (alpha = factor * sigma^2)
19
20 Returns:
21 Generated samples
22 """
23 sigmas = score_model.sigmas
24 num_levels = len(sigmas)
25
26 # Initialize from prior (high noise level)
27 x = torch.randn(num_samples, data_dim, device=device) * sigmas[0]
28
29 # Anneal through noise levels
30 for level_idx in range(num_levels):
31 sigma = sigmas[level_idx]
32 step_size = step_size_factor * (sigma ** 2)
33
34 # Create noise level tensor
35 noise_idx = torch.full((num_samples,), level_idx, dtype=torch.long, device=device)
36
37 # Langevin steps at this noise level
38 for step in range(steps_per_level):
39 # Get score at current noise level
40 score = score_model(x, noise_idx)
41
42 # Langevin update
43 noise = torch.randn_like(x)
44 x = x + (step_size / 2) * score + np.sqrt(step_size) * noise
45
46 return x
47
48
49def annealed_langevin_with_temperature(
50 score_model: NoiseConditionalScoreNetwork,
51 num_samples: int,
52 data_dim: int,
53 temperature: float = 1.0,
54 steps_per_level: int = 100,
55 device: str = 'cuda'
56) -> torch.Tensor:
57 """
58 ALD with temperature control for diversity vs. quality tradeoff.
59
60 Args:
61 temperature: < 1.0 for sharper samples, > 1.0 for more diverse
62 """
63 sigmas = score_model.sigmas
64 x = torch.randn(num_samples, data_dim, device=device) * sigmas[0]
65
66 for level_idx in range(len(sigmas)):
67 sigma = sigmas[level_idx]
68 # Temperature affects noise scale
69 step_size = 2.0 * (sigma ** 2) * temperature
70
71 noise_idx = torch.full((num_samples,), level_idx, dtype=torch.long, device=device)
72
73 for step in range(steps_per_level):
74 score = score_model(x, noise_idx)
75 noise = torch.randn_like(x) * np.sqrt(temperature)
76 x = x + (step_size / 2) * score + np.sqrt(step_size) * noise
77
78 return xWhy Annealing Works:
- High noise: Distribution is approximately Gaussian, easy to sample
- Smooth transition: Gradually refine structure as noise decreases
- Mode connectivity: High noise levels connect modes, preventing mode collapse
- Fine details: Low noise levels capture precise data distribution
Implementation
Here's a complete implementation of a score-based generative model:
1class ScoreBasedGenerativeModel:
2 """
3 Complete score-based generative model.
4 Combines noise-conditional score network with annealed Langevin sampling.
5 """
6 def __init__(
7 self,
8 data_dim: int,
9 hidden_dim: int = 256,
10 num_layers: int = 4,
11 num_noise_levels: int = 10,
12 sigma_min: float = 0.01,
13 sigma_max: float = 10.0,
14 device: str = 'cuda'
15 ):
16 self.data_dim = data_dim
17 self.device = device
18
19 # Score network
20 self.score_net = NoiseConditionalScoreNetwork(
21 data_dim=data_dim,
22 hidden_dim=hidden_dim,
23 num_layers=num_layers,
24 num_noise_levels=num_noise_levels,
25 sigma_min=sigma_min,
26 sigma_max=sigma_max
27 ).to(device)
28
29 self.optimizer = torch.optim.Adam(self.score_net.parameters(), lr=1e-3)
30
31 def train_step(self, x: torch.Tensor) -> float:
32 """Single training step with DSM objective."""
33 self.score_net.train()
34 B = x.shape[0]
35
36 # Sample noise levels
37 noise_idx = torch.randint(
38 0, len(self.score_net.sigmas), (B,),
39 device=self.device
40 )
41 sigma = self.score_net.sigmas[noise_idx].unsqueeze(-1)
42
43 # Add noise
44 epsilon = torch.randn_like(x)
45 x_noisy = x + sigma * epsilon
46
47 # Predict score
48 score_pred = self.score_net(x_noisy, noise_idx)
49
50 # DSM loss with lambda(sigma) = sigma^2 weighting
51 score_true = -epsilon / sigma
52 loss = 0.5 * ((score_pred - score_true) ** 2 * sigma ** 2).sum(dim=-1)
53 loss = loss.mean()
54
55 # Update
56 self.optimizer.zero_grad()
57 loss.backward()
58 self.optimizer.step()
59
60 return loss.item()
61
62 def train(
63 self,
64 dataloader: DataLoader,
65 num_epochs: int,
66 log_every: int = 100
67 ):
68 """Full training loop."""
69 step = 0
70 for epoch in range(num_epochs):
71 epoch_loss = 0.0
72
73 for batch in dataloader:
74 x = batch['data'].to(self.device)
75 loss = self.train_step(x)
76 epoch_loss += loss
77 step += 1
78
79 if step % log_every == 0:
80 print(f"Step {step}: Loss = {loss:.4f}")
81
82 print(f"Epoch {epoch}: Avg Loss = {epoch_loss / len(dataloader):.4f}")
83
84 @torch.no_grad()
85 def sample(
86 self,
87 num_samples: int,
88 steps_per_level: int = 100,
89 return_trajectory: bool = False
90 ) -> torch.Tensor:
91 """Generate samples using annealed Langevin dynamics."""
92 self.score_net.eval()
93
94 sigmas = self.score_net.sigmas
95 x = torch.randn(num_samples, self.data_dim, device=self.device) * sigmas[0]
96
97 trajectory = [x.clone()] if return_trajectory else None
98
99 for level_idx in range(len(sigmas)):
100 sigma = sigmas[level_idx]
101 step_size = 2.0 * (sigma ** 2)
102 noise_idx = torch.full(
103 (num_samples,), level_idx,
104 dtype=torch.long, device=self.device
105 )
106
107 for step in range(steps_per_level):
108 score = self.score_net(x, noise_idx)
109 noise = torch.randn_like(x)
110 x = x + (step_size / 2) * score + np.sqrt(step_size) * noise
111
112 if return_trajectory:
113 trajectory.append(x.clone())
114
115 if return_trajectory:
116 return x, trajectory
117 return x
118
119 def log_likelihood_estimate(
120 self,
121 x: torch.Tensor,
122 num_integration_steps: int = 100
123 ) -> torch.Tensor:
124 """
125 Estimate log-likelihood using score integration.
126 This uses the fact that nabla log p = score.
127 """
128 # This is an approximation using sliced score matching identity
129 # Full implementation would use probability flow ODE
130 self.score_net.eval()
131
132 sigmas = self.score_net.sigmas
133 ll = torch.zeros(x.shape[0], device=self.device)
134
135 for level_idx in range(len(sigmas) - 1):
136 sigma = sigmas[level_idx]
137 noise_idx = torch.full(
138 (x.shape[0],), level_idx,
139 dtype=torch.long, device=self.device
140 )
141
142 score = self.score_net(x, noise_idx)
143
144 # Approximate log-likelihood contribution
145 dsigma = sigmas[level_idx] - sigmas[level_idx + 1]
146 ll = ll - 0.5 * (score ** 2).sum(dim=-1) * dsigma
147
148 return llSummary
Score-based generative models provide a principled foundation for understanding and improving diffusion models. Key insights:
- The score function captures the structure of a distribution without requiring normalization.
- Denoising score matching provides a tractable training objective that is exactly equivalent to diffusion model training.
- Noise conditioning enables learning scores at multiple scales, capturing both global structure and fine details.
- Annealed Langevin dynamics generates samples by following the learned score field from high to low noise levels.
- The equivalence connects noise prediction to score estimation.
In the next section, we'll extend these ideas to continuous time using stochastic differential equations, which provide an even more elegant framework for understanding diffusion as a continuous noising-denoising process.