Boo-AI — Master Artificial Intelligence by Building from Scratch

Introduction

Score-based generative models provide an alternative perspective on diffusion that focuses on the gradient of the log-probability density rather than the density itself. This viewpoint, developed primarily by Yang Song and collaborators, offers elegant theoretical foundations and powerful sampling techniques that unify various generative modeling approaches.

Understanding score-based models reveals why denoising diffusion works and opens pathways to improved training objectives, more efficient samplers, and connections to other areas of machine learning. This section develops the score-based perspective from first principles, showing how it naturally leads to diffusion models.

The Score Function

The score function is the central object in score-based generative modeling. Rather than modeling the probability density $p(x)$ directly (which requires knowing an intractable normalization constant), we model its gradient.

Gradient of Log-Density

The score function is defined as the gradient of the log-density with respect to the data:

Score Function Definition: $s(x) = \nabla_x \log p(x)$
The score tells us the direction and magnitude of the steepest increase in log-probability at any point $x$ .

This definition has a crucial advantage: the normalization constant vanishes when we take the gradient. If $p(x) = \frac{1}{Z} \tilde{p}(x)$ where $Z$ is the normalizing constant, then:

🐍python

1# Mathematical derivation showing normalization cancels
2"""
3p(x) = (1/Z) * p_tilde(x)
4
5log p(x) = log p_tilde(x) - log Z
6
7nabla_x log p(x) = nabla_x log p_tilde(x) - nabla_x log Z
8                 = nabla_x log p_tilde(x) - 0
9                 = nabla_x log p_tilde(x)
10
11The gradient of log Z is zero because Z doesn't depend on x!
12"""
13
14# For a Gaussian: p(x) = (1/sqrt(2*pi*sigma^2)) * exp(-x^2 / (2*sigma^2))
15# Score: s(x) = -x / sigma^2
16
17import torch
18
19def gaussian_score(x: torch.Tensor, sigma: float = 1.0) -> torch.Tensor:
20    """Analytical score for Gaussian distribution."""
21    return -x / (sigma ** 2)
22
23# For mixture of Gaussians (more complex)
24def mixture_score(x: torch.Tensor, means: torch.Tensor, weights: torch.Tensor, sigma: float) -> torch.Tensor:
25    """Score for Gaussian mixture - requires computing density first."""
26    # Component densities
27    diffs = x.unsqueeze(-1) - means  # [..., K]
28    log_probs = -0.5 * (diffs ** 2) / (sigma ** 2)
29    probs = torch.softmax(log_probs + torch.log(weights), dim=-1)
30
31    # Weighted sum of component scores
32    component_scores = -diffs / (sigma ** 2)
33    return (probs * component_scores).sum(dim=-1)

Geometric Intuition

The score function has a beautiful geometric interpretation: at every point in space, it points toward regions of higher probability. Following the score leads to the modes of the distribution:

Property	Score Function Behavior	Interpretation
At mode	s(x) = 0	Gradient vanishes at maximum
Near mode	Points toward mode	Gradient descent reaches mode
Low density region	Large magnitude	Strong push toward high density
Between modes	Points to nearest mode	Basins of attraction

For a multivariate distribution, the score is a vector field. Visualizing this field reveals the structure of the distribution: modes appear as sinks (where vectors converge), and saddle points create separations between basins.

🐍python

1import torch
2import matplotlib.pyplot as plt
3import numpy as np
4
5def visualize_score_field(
6    score_fn,
7    xlim: tuple = (-3, 3),
8    ylim: tuple = (-3, 3),
9    resolution: int = 20
10):
11    """Visualize a 2D score field as a vector plot."""
12    # Create grid
13    x = torch.linspace(xlim[0], xlim[1], resolution)
14    y = torch.linspace(ylim[0], ylim[1], resolution)
15    X, Y = torch.meshgrid(x, y, indexing='xy')
16    points = torch.stack([X.flatten(), Y.flatten()], dim=-1)
17
18    # Compute scores
19    scores = score_fn(points)
20    U = scores[:, 0].reshape(resolution, resolution)
21    V = scores[:, 1].reshape(resolution, resolution)
22
23    # Plot
24    plt.figure(figsize=(8, 8))
25    plt.quiver(
26        X.numpy(), Y.numpy(),
27        U.numpy(), V.numpy(),
28        np.sqrt(U.numpy()**2 + V.numpy()**2),  # Color by magnitude
29        cmap='viridis'
30    )
31    plt.colorbar(label='Score magnitude')
32    plt.title('Score Field Visualization')
33    plt.xlabel('x')
34    plt.ylabel('y')
35    return plt.gcf()
36
37
38# Example: Score field for mixture of two Gaussians
39def two_mode_score(x: torch.Tensor) -> torch.Tensor:
40    """Score for mixture of two 2D Gaussians."""
41    mean1 = torch.tensor([-1.0, 0.0])
42    mean2 = torch.tensor([1.0, 0.0])
43    sigma = 0.5
44
45    # Compute responsibilities (which Gaussian each point belongs to)
46    d1 = ((x - mean1) ** 2).sum(dim=-1)
47    d2 = ((x - mean2) ** 2).sum(dim=-1)
48    log_p1 = -d1 / (2 * sigma ** 2)
49    log_p2 = -d2 / (2 * sigma ** 2)
50
51    w1 = torch.softmax(torch.stack([log_p1, log_p2], dim=-1), dim=-1)[..., 0]
52    w2 = 1 - w1
53
54    # Weighted combination of scores
55    score1 = -(x - mean1) / (sigma ** 2)
56    score2 = -(x - mean2) / (sigma ** 2)
57
58    return w1.unsqueeze(-1) * score1 + w2.unsqueeze(-1) * score2

Score Matching

Given that the score function is valuable, how do we learn it from data? The naive approach of computing $\nabla_x \log p(x)$ fails because we don't know $p(x)$ . Score matching provides an elegant solution.

Fisher Divergence

We want our model score $s_\theta(x)$ to match the true score $\nabla_x \log p(x)$ . A natural objective is the Fisher divergence:

Fisher Divergence: $D_F(p \| p_\theta) = \frac{1}{2} \mathbb{E}_{p(x)} \left[ \| s_\theta(x) - \nabla_x \log p(x) \|^2 \right]$

This measures the expected squared difference between our model score and the true score. However, it still contains the unknown $\nabla_x \log p(x)$ . The key insight is that we can expand and simplify:

🐍python

1"""
2Fisher Divergence Expansion:
3
4D_F = (1/2) * E_p[ ||s_theta(x) - nabla log p(x)||^2 ]
5
6    = (1/2) * E_p[ ||s_theta(x)||^2 - 2 * s_theta(x)^T * nabla log p(x) + ||nabla log p(x)||^2 ]
7
8The last term doesn't depend on theta, so for optimization:
9
10J_SM(theta) = E_p[ (1/2) * ||s_theta(x)||^2 + trace(nabla_x s_theta(x)) ]
11
12This is the SCORE MATCHING objective - no ground truth score needed!
13The trace term comes from integration by parts (Stein's identity).
14"""
15
16def score_matching_loss(
17    score_model: nn.Module,
18    x: torch.Tensor
19) -> torch.Tensor:
20    """
21    Explicit score matching loss.
22    Requires computing Jacobian trace, which is expensive.
23    """
24    x = x.requires_grad_(True)
25
26    # Compute score
27    score = score_model(x)
28
29    # Compute trace of Jacobian (sum of diagonal elements)
30    trace = 0.0
31    for i in range(x.shape[-1]):
32        grad_i = torch.autograd.grad(
33            score[:, i].sum(),
34            x,
35            create_graph=True
36        )[0][:, i]
37        trace = trace + grad_i
38
39    # Score matching objective
40    loss = 0.5 * (score ** 2).sum(dim=-1) + trace
41
42    return loss.mean()

Implicit Score Matching

The explicit score matching objective requires computing the trace of the Jacobian, which is expensive. Implicit score matching avoids this by exploiting the structure of the score function:

🐍python

1def sliced_score_matching_loss(
2    score_model: nn.Module,
3    x: torch.Tensor,
4    num_projections: int = 1
5) -> torch.Tensor:
6    """
7    Sliced score matching: estimate trace using random projections.
8    Much more efficient than full Jacobian computation.
9    """
10    x = x.requires_grad_(True)
11    score = score_model(x)
12
13    total_loss = 0.0
14
15    for _ in range(num_projections):
16        # Random projection vector
17        v = torch.randn_like(x)
18        v = v / v.norm(dim=-1, keepdim=True)
19
20        # Compute v^T * score
21        score_v = (score * v).sum(dim=-1)
22
23        # Compute v^T * (Jacobian * v) = directional derivative
24        grad_score_v = torch.autograd.grad(
25            score_v.sum(),
26            x,
27            create_graph=True
28        )[0]
29
30        # v^T * nabla * (v^T * s) gives trace estimate
31        jacobian_trace_estimate = (grad_score_v * v).sum(dim=-1)
32
33        # Loss for this projection
34        loss = 0.5 * score_v ** 2 + jacobian_trace_estimate
35        total_loss = total_loss + loss
36
37    return total_loss.mean() / num_projections

Sliced score matching estimates the trace using random projections, reducing the cost from $O(d^2)$ to $O(d)$ where $d$ is the data dimension.

Denoising Score Matching

The most practical score matching variant is denoising score matching(DSM), which has a direct connection to diffusion models. Instead of matching the score of the data distribution, we match the score of a noise-perturbed distribution.

The DSM Objective

Given clean data $x$ , we add Gaussian noise to create $\tilde{x} = x + \sigma \epsilon$ where $\epsilon \sim \mathcal{N}(0, I)$ . The key insight is that we know the score of the noisy distribution analytically:

DSM Key Identity: $\nabla_{\tilde{x}} \log p(\tilde{x} | x) = \frac{x - \tilde{x}}{\sigma^2} = -\frac{\epsilon}{\sigma}$
The score of the conditional $p(\tilde{x}|x)$ points from the noisy sample back toward the clean data!

🐍python

1def denoising_score_matching_loss(
2    score_model: nn.Module,
3    x: torch.Tensor,
4    sigma: float
5) -> torch.Tensor:
6    """
7    Denoising Score Matching loss.
8    This is exactly the diffusion training objective!
9
10    Args:
11        score_model: s_theta(x, sigma) predicts score at noise level sigma
12        x: Clean data samples
13        sigma: Noise standard deviation
14    """
15    # Add noise
16    epsilon = torch.randn_like(x)
17    x_noisy = x + sigma * epsilon
18
19    # Model predicts score at noise level sigma
20    score_pred = score_model(x_noisy, sigma)
21
22    # True score of p(x_noisy | x) = -epsilon / sigma
23    score_true = -epsilon / sigma
24
25    # L2 loss
26    loss = 0.5 * ((score_pred - score_true) ** 2).sum(dim=-1)
27
28    return loss.mean()
29
30
31# Alternatively, predict noise directly (epsilon-prediction)
32def denoising_score_matching_epsilon(
33    epsilon_model: nn.Module,
34    x: torch.Tensor,
35    sigma: float
36) -> torch.Tensor:
37    """
38    DSM with epsilon-prediction (equivalent formulation).
39    This is the standard diffusion training objective.
40    """
41    epsilon = torch.randn_like(x)
42    x_noisy = x + sigma * epsilon
43
44    # Model predicts epsilon
45    epsilon_pred = epsilon_model(x_noisy, sigma)
46
47    # Simple MSE on noise
48    loss = ((epsilon_pred - epsilon) ** 2).sum(dim=-1)
49
50    return loss.mean()

DSM Equivalences:
Score prediction: s_theta = -epsilon / sigma, optimize ||s_theta - true_score||^2
Epsilon prediction: epsilon_theta = -sigma * s_theta, optimize ||epsilon_theta - epsilon||^2
x0 prediction: x0_theta = x_noisy + sigma^2 * s_theta, optimize ||x0_theta - x||^2
All three are mathematically equivalent, differing only by scaling

Connection to Diffusion Models

Denoising score matching reveals why diffusion models work: training a diffusion model is exactly training a score estimator. The noise prediction objective is equivalent to learning the score of the noisy data distribution at each noise level.

This connection provides theoretical grounding for diffusion training:

The score of $q(x_t | x_0)$ can be computed analytically given the clean data $x_0$ .
Training minimizes Fisher divergence between model and true conditional score.
At convergence, $\epsilon_\theta \approx -\sigma \nabla_x \log q(x_t)$ .
Sampling uses the learned score to reverse the noising process.

Noise-Conditional Score Networks

Real data distributions have complex structures that are hard to learn at a single noise level. Score-based models address this through noise-conditionalnetworks that learn scores across multiple noise scales.

Multi-Scale Score Estimation

The idea is to define a sequence of noise levels $\sigma_1 > \sigma_2 > ... > \sigma_L$ and train a single network to estimate scores at all levels:

🐍python

1class NoiseConditionalScoreNetwork(nn.Module):
2    """
3    Score network conditioned on noise level.
4    This is the architecture behind NCSN and diffusion models.
5    """
6    def __init__(
7        self,
8        data_dim: int,
9        hidden_dim: int = 256,
10        num_layers: int = 4,
11        num_noise_levels: int = 10,
12        sigma_min: float = 0.01,
13        sigma_max: float = 10.0
14    ):
15        super().__init__()
16
17        # Noise level schedule (geometric)
18        self.register_buffer(
19            'sigmas',
20            torch.exp(torch.linspace(
21                np.log(sigma_max),
22                np.log(sigma_min),
23                num_noise_levels
24            ))
25        )
26
27        # Noise level embedding
28        self.noise_embedding = nn.Embedding(num_noise_levels, hidden_dim)
29
30        # Score network
31        layers = []
32        layers.append(nn.Linear(data_dim + hidden_dim, hidden_dim))
33        layers.append(nn.SiLU())
34
35        for _ in range(num_layers - 2):
36            layers.append(nn.Linear(hidden_dim, hidden_dim))
37            layers.append(nn.SiLU())
38
39        layers.append(nn.Linear(hidden_dim, data_dim))
40
41        self.network = nn.Sequential(*layers)
42
43    def forward(
44        self,
45        x: torch.Tensor,
46        noise_level_idx: torch.Tensor
47    ) -> torch.Tensor:
48        """
49        Predict score at given noise level.
50
51        Args:
52            x: Input samples [B, D]
53            noise_level_idx: Noise level indices [B]
54
55        Returns:
56            Predicted score [B, D]
57        """
58        # Get noise embedding
59        noise_emb = self.noise_embedding(noise_level_idx)  # [B, hidden_dim]
60
61        # Concatenate and predict
62        combined = torch.cat([x, noise_emb], dim=-1)
63        score = self.network(combined)
64
65        # Scale output by 1/sigma for numerical stability
66        sigma = self.sigmas[noise_level_idx].unsqueeze(-1)
67        return score / sigma
68
69
70def train_ncsn(
71    model: NoiseConditionalScoreNetwork,
72    dataloader: DataLoader,
73    optimizer: torch.optim.Optimizer,
74    num_epochs: int = 100
75):
76    """Train noise-conditional score network."""
77    model.train()
78
79    for epoch in range(num_epochs):
80        total_loss = 0.0
81
82        for batch in dataloader:
83            x = batch['data']
84            B = x.shape[0]
85
86            # Sample random noise levels
87            noise_idx = torch.randint(0, len(model.sigmas), (B,))
88            sigma = model.sigmas[noise_idx].unsqueeze(-1)
89
90            # Add noise
91            epsilon = torch.randn_like(x)
92            x_noisy = x + sigma * epsilon
93
94            # Predict score
95            score_pred = model(x_noisy, noise_idx)
96
97            # DSM loss (with sigma^2 weighting for stability)
98            score_true = -epsilon / sigma
99            loss = 0.5 * ((score_pred - score_true) ** 2 * sigma ** 2).sum(dim=-1)
100
101            optimizer.zero_grad()
102            loss.mean().backward()
103            optimizer.step()
104
105            total_loss += loss.mean().item()
106
107        print(f"Epoch {epoch}: Loss = {total_loss / len(dataloader):.4f}")

The geometric noise schedule ensures coverage of both fine details (small $\sigma$ ) and global structure (large $\sigma$ ). Weighting by $\sigma^2$ in the loss balances contributions across scales.

Langevin Dynamics Sampling

Once we have a trained score network, we can generate samples using Langevin dynamics: an iterative process that follows the score while adding noise to ensure proper exploration of the distribution.

The Langevin Algorithm

Langevin dynamics is a first-order stochastic differential equation that converges to samples from $p(x)$ given only the score:

Langevin Dynamics: $x_{t+1} = x_t + \frac{\alpha}{2} \nabla_x \log p(x_t) + \sqrt{\alpha} z_t$
where $z_t \sim \mathcal{N}(0, I)$ and $\alpha$ is the step size. As $t \to \infty$ and $\alpha \to 0$ , samples converge to the target distribution.

🐍python

1def langevin_dynamics(
2    score_fn: Callable[[torch.Tensor], torch.Tensor],
3    x_init: torch.Tensor,
4    step_size: float = 0.01,
5    num_steps: int = 1000,
6    noise_scale: float = 1.0
7) -> torch.Tensor:
8    """
9    Sample using Langevin dynamics.
10
11    Args:
12        score_fn: Function that returns score at x
13        x_init: Initial samples
14        step_size: Step size alpha
15        num_steps: Number of Langevin steps
16        noise_scale: Scale for noise (1.0 for exact Langevin)
17
18    Returns:
19        Final samples after Langevin iterations
20    """
21    x = x_init.clone()
22
23    for t in range(num_steps):
24        # Score
25        score = score_fn(x)
26
27        # Langevin update
28        noise = torch.randn_like(x)
29        x = x + (step_size / 2) * score + np.sqrt(step_size) * noise_scale * noise
30
31    return x
32
33
34def langevin_dynamics_with_trajectory(
35    score_fn: Callable[[torch.Tensor], torch.Tensor],
36    x_init: torch.Tensor,
37    step_size: float = 0.01,
38    num_steps: int = 1000,
39    save_every: int = 10
40) -> List[torch.Tensor]:
41    """Langevin dynamics saving intermediate states for visualization."""
42    x = x_init.clone()
43    trajectory = [x.clone()]
44
45    for t in range(num_steps):
46        score = score_fn(x)
47        noise = torch.randn_like(x)
48        x = x + (step_size / 2) * score + np.sqrt(step_size) * noise
49
50        if (t + 1) % save_every == 0:
51            trajectory.append(x.clone())
52
53    return trajectory

Pure Langevin dynamics works well for simple distributions but struggles with multi-modal distributions where the score is unreliable in low-density regions between modes.

Annealed Langevin Dynamics

Annealed Langevin dynamics (ALD) addresses the multi-modal problem by starting at high noise levels (where modes are connected) and gradually decreasing noise:

🐍python

1def annealed_langevin_dynamics(
2    score_model: NoiseConditionalScoreNetwork,
3    num_samples: int,
4    data_dim: int,
5    steps_per_level: int = 100,
6    step_size_factor: float = 2.0,
7    device: str = 'cuda'
8) -> torch.Tensor:
9    """
10    Annealed Langevin dynamics for sampling.
11    This is the original NCSN sampling algorithm.
12
13    Args:
14        score_model: Trained noise-conditional score network
15        num_samples: Number of samples to generate
16        data_dim: Dimension of data
17        steps_per_level: Langevin steps at each noise level
18        step_size_factor: Multiplier for step size (alpha = factor * sigma^2)
19
20    Returns:
21        Generated samples
22    """
23    sigmas = score_model.sigmas
24    num_levels = len(sigmas)
25
26    # Initialize from prior (high noise level)
27    x = torch.randn(num_samples, data_dim, device=device) * sigmas[0]
28
29    # Anneal through noise levels
30    for level_idx in range(num_levels):
31        sigma = sigmas[level_idx]
32        step_size = step_size_factor * (sigma ** 2)
33
34        # Create noise level tensor
35        noise_idx = torch.full((num_samples,), level_idx, dtype=torch.long, device=device)
36
37        # Langevin steps at this noise level
38        for step in range(steps_per_level):
39            # Get score at current noise level
40            score = score_model(x, noise_idx)
41
42            # Langevin update
43            noise = torch.randn_like(x)
44            x = x + (step_size / 2) * score + np.sqrt(step_size) * noise
45
46    return x
47
48
49def annealed_langevin_with_temperature(
50    score_model: NoiseConditionalScoreNetwork,
51    num_samples: int,
52    data_dim: int,
53    temperature: float = 1.0,
54    steps_per_level: int = 100,
55    device: str = 'cuda'
56) -> torch.Tensor:
57    """
58    ALD with temperature control for diversity vs. quality tradeoff.
59
60    Args:
61        temperature: < 1.0 for sharper samples, > 1.0 for more diverse
62    """
63    sigmas = score_model.sigmas
64    x = torch.randn(num_samples, data_dim, device=device) * sigmas[0]
65
66    for level_idx in range(len(sigmas)):
67        sigma = sigmas[level_idx]
68        # Temperature affects noise scale
69        step_size = 2.0 * (sigma ** 2) * temperature
70
71        noise_idx = torch.full((num_samples,), level_idx, dtype=torch.long, device=device)
72
73        for step in range(steps_per_level):
74            score = score_model(x, noise_idx)
75            noise = torch.randn_like(x) * np.sqrt(temperature)
76            x = x + (step_size / 2) * score + np.sqrt(step_size) * noise
77
78    return x

Why Annealing Works:
High noise: Distribution is approximately Gaussian, easy to sample
Smooth transition: Gradually refine structure as noise decreases
Mode connectivity: High noise levels connect modes, preventing mode collapse
Fine details: Low noise levels capture precise data distribution

Implementation

Here's a complete implementation of a score-based generative model:

🐍python

1class ScoreBasedGenerativeModel:
2    """
3    Complete score-based generative model.
4    Combines noise-conditional score network with annealed Langevin sampling.
5    """
6    def __init__(
7        self,
8        data_dim: int,
9        hidden_dim: int = 256,
10        num_layers: int = 4,
11        num_noise_levels: int = 10,
12        sigma_min: float = 0.01,
13        sigma_max: float = 10.0,
14        device: str = 'cuda'
15    ):
16        self.data_dim = data_dim
17        self.device = device
18
19        # Score network
20        self.score_net = NoiseConditionalScoreNetwork(
21            data_dim=data_dim,
22            hidden_dim=hidden_dim,
23            num_layers=num_layers,
24            num_noise_levels=num_noise_levels,
25            sigma_min=sigma_min,
26            sigma_max=sigma_max
27        ).to(device)
28
29        self.optimizer = torch.optim.Adam(self.score_net.parameters(), lr=1e-3)
30
31    def train_step(self, x: torch.Tensor) -> float:
32        """Single training step with DSM objective."""
33        self.score_net.train()
34        B = x.shape[0]
35
36        # Sample noise levels
37        noise_idx = torch.randint(
38            0, len(self.score_net.sigmas), (B,),
39            device=self.device
40        )
41        sigma = self.score_net.sigmas[noise_idx].unsqueeze(-1)
42
43        # Add noise
44        epsilon = torch.randn_like(x)
45        x_noisy = x + sigma * epsilon
46
47        # Predict score
48        score_pred = self.score_net(x_noisy, noise_idx)
49
50        # DSM loss with lambda(sigma) = sigma^2 weighting
51        score_true = -epsilon / sigma
52        loss = 0.5 * ((score_pred - score_true) ** 2 * sigma ** 2).sum(dim=-1)
53        loss = loss.mean()
54
55        # Update
56        self.optimizer.zero_grad()
57        loss.backward()
58        self.optimizer.step()
59
60        return loss.item()
61
62    def train(
63        self,
64        dataloader: DataLoader,
65        num_epochs: int,
66        log_every: int = 100
67    ):
68        """Full training loop."""
69        step = 0
70        for epoch in range(num_epochs):
71            epoch_loss = 0.0
72
73            for batch in dataloader:
74                x = batch['data'].to(self.device)
75                loss = self.train_step(x)
76                epoch_loss += loss
77                step += 1
78
79                if step % log_every == 0:
80                    print(f"Step {step}: Loss = {loss:.4f}")
81
82            print(f"Epoch {epoch}: Avg Loss = {epoch_loss / len(dataloader):.4f}")
83
84    @torch.no_grad()
85    def sample(
86        self,
87        num_samples: int,
88        steps_per_level: int = 100,
89        return_trajectory: bool = False
90    ) -> torch.Tensor:
91        """Generate samples using annealed Langevin dynamics."""
92        self.score_net.eval()
93
94        sigmas = self.score_net.sigmas
95        x = torch.randn(num_samples, self.data_dim, device=self.device) * sigmas[0]
96
97        trajectory = [x.clone()] if return_trajectory else None
98
99        for level_idx in range(len(sigmas)):
100            sigma = sigmas[level_idx]
101            step_size = 2.0 * (sigma ** 2)
102            noise_idx = torch.full(
103                (num_samples,), level_idx,
104                dtype=torch.long, device=self.device
105            )
106
107            for step in range(steps_per_level):
108                score = self.score_net(x, noise_idx)
109                noise = torch.randn_like(x)
110                x = x + (step_size / 2) * score + np.sqrt(step_size) * noise
111
112            if return_trajectory:
113                trajectory.append(x.clone())
114
115        if return_trajectory:
116            return x, trajectory
117        return x
118
119    def log_likelihood_estimate(
120        self,
121        x: torch.Tensor,
122        num_integration_steps: int = 100
123    ) -> torch.Tensor:
124        """
125        Estimate log-likelihood using score integration.
126        This uses the fact that nabla log p = score.
127        """
128        # This is an approximation using sliced score matching identity
129        # Full implementation would use probability flow ODE
130        self.score_net.eval()
131
132        sigmas = self.score_net.sigmas
133        ll = torch.zeros(x.shape[0], device=self.device)
134
135        for level_idx in range(len(sigmas) - 1):
136            sigma = sigmas[level_idx]
137            noise_idx = torch.full(
138                (x.shape[0],), level_idx,
139                dtype=torch.long, device=self.device
140            )
141
142            score = self.score_net(x, noise_idx)
143
144            # Approximate log-likelihood contribution
145            dsigma = sigmas[level_idx] - sigmas[level_idx + 1]
146            ll = ll - 0.5 * (score ** 2).sum(dim=-1) * dsigma
147
148        return ll

Summary

Score-based generative models provide a principled foundation for understanding and improving diffusion models. Key insights:

The score function $s(x) = \nabla_x \log p(x)$ captures the structure of a distribution without requiring normalization.
Denoising score matching provides a tractable training objective that is exactly equivalent to diffusion model training.
Noise conditioning enables learning scores at multiple scales, capturing both global structure and fine details.
Annealed Langevin dynamics generates samples by following the learned score field from high to low noise levels.
The equivalence $\epsilon_\theta \approx -\sigma \cdot s_\theta$ connects noise prediction to score estimation.

In the next section, we'll extend these ideas to continuous time using stochastic differential equations, which provide an even more elegant framework for understanding diffusion as a continuous noising-denoising process.