Chapter 17
20 min read
Section 73 of 76

Video Generation

The Future of Diffusion Models

Learning Objectives

By the end of this section, you will be able to:

  1. Understand the unique challenges of extending diffusion models from images to video
  2. Explain temporal attention mechanisms and how they model motion and consistency
  3. Compare different video architectures including 3D U-Nets and factorized approaches
  4. Understand autoregressive-diffusion hybrids used in systems like Sora
  5. Appreciate the current limitations and active research directions in video generation

From Images to Video

Video generation represents one of the most exciting frontiers in diffusion modeling. While image generation has achieved remarkable quality, extending to video introduces fundamental new challenges:

ChallengeImage GenerationVideo Generation
DimensionalityH x W x 3T x H x W x 3 (100x+ more data)
ConsistencySingle frame coherenceCross-frame coherence
MotionNot applicablePhysics, smooth trajectories
Memory~8 GB for SDXL~80+ GB for high-res video
ComputeSeconds per imageMinutes to hours per video
The Core Challenge: A 5-second 1080p video at 24 fps contains 120 frames, each with 2 million pixels. That's 240 million pixels that must be temporally coherent - objects must move smoothly, lighting must be consistent, and physics must be plausible.

Why Temporal Consistency Is Hard

If we naively generate each frame independently using an image model, we get "flickering" - objects randomly change appearance, lighting fluctuates, and motion is jerky. The model has no understanding that frames are connected.

Humans are extremely sensitive to temporal inconsistency. Even subtle flickering in textures or colors is immediately noticeable and distracting. Video models must learn:

  • Object persistence: The same cat should look the same in frame 1 and frame 100
  • Motion smoothness: Objects should follow smooth trajectories, not teleport
  • Physical plausibility: Objects should obey gravity, collisions, etc.
  • Temporal causality: Effects should follow causes (glass breaks after being hit)

Temporal Modeling

The key innovation in video diffusion models is addingtemporal attention alongside spatial attention. This allows the model to reason about relationships between frames.

Temporal Attention

In addition to attending to all spatial positions within a frame, temporal attention allows each position to attend to the same spatial position across time:

🐍python
1import torch
2import torch.nn as nn
3from einops import rearrange
4
5class TemporalAttention(nn.Module):
6    """
7    Temporal attention for video models.
8    Each spatial position attends to the same position across all frames.
9    """
10
11    def __init__(self, dim, num_heads=8):
12        super().__init__()
13        self.num_heads = num_heads
14        self.head_dim = dim // num_heads
15        self.scale = self.head_dim ** -0.5
16
17        self.to_qkv = nn.Linear(dim, dim * 3)
18        self.to_out = nn.Linear(dim, dim)
19
20    def forward(self, x):
21        # x: (batch, time, height, width, dim)
22        b, t, h, w, d = x.shape
23
24        # Rearrange so each spatial position can attend across time
25        # (batch * height * width, time, dim)
26        x = rearrange(x, 'b t h w d -> (b h w) t d')
27
28        # Standard attention computation
29        qkv = self.to_qkv(x).chunk(3, dim=-1)
30        q, k, v = map(
31            lambda t: rearrange(t, 'n t (h d) -> n h t d', h=self.num_heads),
32            qkv
33        )
34
35        # Attention scores across time dimension
36        attn = torch.matmul(q, k.transpose(-1, -2)) * self.scale
37        attn = attn.softmax(dim=-1)
38
39        # Apply attention
40        out = torch.matmul(attn, v)
41        out = rearrange(out, 'n h t d -> n t (h d)')
42        out = self.to_out(out)
43
44        # Restore spatial dimensions
45        out = rearrange(out, '(b h w) t d -> b t h w d', b=b, h=h, w=w)
46
47        return out

Combined Spatiotemporal Attention

Modern video models typically alternate between spatial and temporal attention, allowing information to flow both within and across frames:

🐍python
1class SpatiotemporalBlock(nn.Module):
2    """
3    Block that applies both spatial and temporal attention.
4    Pattern: Spatial -> Temporal -> FFN
5    """
6
7    def __init__(self, dim, spatial_heads=8, temporal_heads=8):
8        super().__init__()
9
10        # Spatial attention (within each frame)
11        self.spatial_norm = nn.LayerNorm(dim)
12        self.spatial_attn = SpatialAttention(dim, spatial_heads)
13
14        # Temporal attention (across frames)
15        self.temporal_norm = nn.LayerNorm(dim)
16        self.temporal_attn = TemporalAttention(dim, temporal_heads)
17
18        # Feed-forward network
19        self.ffn_norm = nn.LayerNorm(dim)
20        self.ffn = nn.Sequential(
21            nn.Linear(dim, dim * 4),
22            nn.GELU(),
23            nn.Linear(dim * 4, dim),
24        )
25
26    def forward(self, x, t_emb=None):
27        # x: (batch, time, height, width, dim)
28
29        # Spatial attention (within each frame)
30        h = self.spatial_norm(x)
31        h = rearrange(h, 'b t h w d -> (b t) h w d')  # Treat each frame independently
32        h = self.spatial_attn(h)
33        h = rearrange(h, '(b t) h w d -> b t h w d', t=x.shape[1])
34        x = x + h
35
36        # Temporal attention (across frames)
37        h = self.temporal_norm(x)
38        h = self.temporal_attn(h)
39        x = x + h
40
41        # FFN
42        h = self.ffn_norm(x)
43        h = self.ffn(h)
44        x = x + h
45
46        return x

Video Diffusion Architectures

3D U-Net

The most straightforward extension is to use 3D convolutions in the U-Net, allowing the network to jointly process spatial and temporal dimensions:

🐍python
1class Video3DConvBlock(nn.Module):
2    """3D convolution block for video U-Net."""
3
4    def __init__(self, in_channels, out_channels, time_kernel=3):
5        super().__init__()
6
7        # 3D convolution: processes space and time together
8        self.conv = nn.Conv3d(
9            in_channels, out_channels,
10            kernel_size=(time_kernel, 3, 3),
11            padding=(time_kernel // 2, 1, 1),
12        )
13        self.norm = nn.GroupNorm(8, out_channels)
14        self.act = nn.SiLU()
15
16    def forward(self, x):
17        # x: (batch, channels, time, height, width)
18        return self.act(self.norm(self.conv(x)))
19
20class VideoUNet3D(nn.Module):
21    """
22    3D U-Net for video diffusion.
23    Processes video as a 5D tensor: (B, C, T, H, W)
24    """
25
26    def __init__(self, in_channels=4, model_channels=320, num_frames=16):
27        super().__init__()
28        self.num_frames = num_frames
29
30        # Encoder with 3D convolutions
31        self.encoder = nn.ModuleList([
32            Video3DConvBlock(in_channels, model_channels),
33            Video3DConvBlock(model_channels, model_channels * 2),
34            Video3DConvBlock(model_channels * 2, model_channels * 4),
35        ])
36
37        # Bottleneck with spatiotemporal attention
38        self.bottleneck = SpatiotemporalBlock(model_channels * 4)
39
40        # Decoder with 3D convolutions
41        self.decoder = nn.ModuleList([
42            Video3DConvBlock(model_channels * 4, model_channels * 2),
43            Video3DConvBlock(model_channels * 2, model_channels),
44            Video3DConvBlock(model_channels, in_channels),
45        ])
46
47    def forward(self, x, t, context=None):
48        # x: (B, C, T, H, W) - video latents
49        # t: timestep embedding
50        # context: text conditioning
51
52        # Process through U-Net
53        # (Simplified - real implementation has skip connections, time conditioning, etc.)
54        h = x
55        for block in self.encoder:
56            h = block(h)
57
58        h = rearrange(h, 'b c t h w -> b t h w c')
59        h = self.bottleneck(h)
60        h = rearrange(h, 'b t h w c -> b c t h w')
61
62        for block in self.decoder:
63            h = block(h)
64
65        return h

Factorized Approaches

Full 3D attention is computationally expensive. Many systems usefactorized attention - separating spatial and temporal processing:

ApproachDescriptionTrade-off
Full 3DEvery position attends to all spatiotemporal positionsBest quality, highest cost
Spatial + TemporalSeparate 2D spatial and 1D temporal attentionGood balance, most common
Causal TemporalFrame t only attends to frames < tEnables autoregressive generation
Windowed TemporalAttend to nearby frames onlyEfficient for long videos

State-of-the-Art Systems

Several breakthrough systems have demonstrated remarkable video generation capabilities:

OpenAI Sora (2024)

Sora represents a paradigm shift, generating minute-long videos with unprecedented consistency and quality. Key innovations:

  • Diffusion Transformer (DiT): Uses transformer architecture instead of U-Net, scaling more efficiently
  • Spacetime Patches: Treats video as 3D patches, like ViT but extended to video
  • Variable Resolution/Duration: Trained on diverse aspect ratios and lengths
  • Recaptioning: Uses LLMs to create detailed training captions
🐍python
1class DiTBlock(nn.Module):
2    """
3    Diffusion Transformer block (conceptual implementation).
4    Uses adaptive layer norm (adaLN) for conditioning.
5    """
6
7    def __init__(self, dim, num_heads):
8        super().__init__()
9        self.attention = nn.MultiheadAttention(dim, num_heads)
10        self.ffn = nn.Sequential(
11            nn.Linear(dim, dim * 4),
12            nn.GELU(),
13            nn.Linear(dim * 4, dim),
14        )
15
16        # Adaptive LayerNorm for conditioning
17        self.adaln = nn.Sequential(
18            nn.SiLU(),
19            nn.Linear(dim, dim * 6),  # shift, scale for attn and ffn
20        )
21
22    def forward(self, x, t_emb):
23        # Get conditioning scales and shifts
24        adaln_out = self.adaln(t_emb)
25        shift_attn, scale_attn, gate_attn, shift_ffn, scale_ffn, gate_ffn =             adaln_out.chunk(6, dim=-1)
26
27        # Modulated attention
28        h = x * (1 + scale_attn.unsqueeze(1)) + shift_attn.unsqueeze(1)
29        h = self.attention(h, h, h)[0]
30        x = x + gate_attn.unsqueeze(1) * h
31
32        # Modulated FFN
33        h = x * (1 + scale_ffn.unsqueeze(1)) + shift_ffn.unsqueeze(1)
34        h = self.ffn(h)
35        x = x + gate_ffn.unsqueeze(1) * h
36
37        return x

Runway Gen-3 Alpha

  • High-fidelity motion: Excellent at complex camera movements and subject motion
  • Image-to-video: Can animate static images with specified motion
  • Multi-modal conditioning: Text, image, or reference motion

Stable Video Diffusion (Open Source)

🐍python
1from diffusers import StableVideoDiffusionPipeline
2from diffusers.utils import load_image, export_to_video
3import torch
4
5# Load the pipeline
6pipe = StableVideoDiffusionPipeline.from_pretrained(
7    "stabilityai/stable-video-diffusion-img2vid-xt",
8    torch_dtype=torch.float16,
9    variant="fp16",
10)
11pipe.to("cuda")
12
13# Load a conditioning image
14image = load_image("input_image.png")
15image = image.resize((1024, 576))  # SVD requires specific resolution
16
17# Generate video frames
18generator = torch.Generator("cuda").manual_seed(42)
19
20frames = pipe(
21    image,
22    decode_chunk_size=8,  # Memory optimization
23    generator=generator,
24    num_frames=25,  # Generate 25 frames
25    motion_bucket_id=127,  # Control amount of motion (0-255)
26    noise_aug_strength=0.02,  # Noise augmentation for conditioning image
27).frames[0]
28
29# Export to video file
30export_to_video(frames, "output_video.mp4", fps=7)

Pika Labs

  • Consumer-focused: Optimized for accessibility and ease of use
  • Lip sync: Specialized features for talking head generation
  • Style transfer: Video-to-video style transformation

Implementation Concepts

Training Data Considerations

Video training data is more complex than images:

  • Caption quality: Videos need frame-aligned or segment-level captions describing action, not just objects
  • Motion diversity: Training set must include diverse camera movements, subject motions, and scene dynamics
  • Temporal resolution: Different frame rates capture different types of motion
  • Quality filtering: Remove static videos, watermarks, poor quality clips

Latent Video Diffusion

Like image models, video models typically work in latent space for efficiency. The VAE must be extended to handle temporal dimensions:

🐍python
1class VideoVAE(nn.Module):
2    """
3    Video VAE that compresses along spatial AND temporal dimensions.
4    Typical compression: 8x spatial, 4x temporal.
5    """
6
7    def __init__(self, latent_channels=4, spatial_compress=8, temporal_compress=4):
8        super().__init__()
9        self.spatial_compress = spatial_compress
10        self.temporal_compress = temporal_compress
11
12        # 3D encoder
13        self.encoder = nn.Sequential(
14            # Spatial downsampling
15            nn.Conv3d(3, 64, kernel_size=(3, 4, 4), stride=(1, 2, 2), padding=(1, 1, 1)),
16            nn.SiLU(),
17            nn.Conv3d(64, 128, kernel_size=(3, 4, 4), stride=(1, 2, 2), padding=(1, 1, 1)),
18            nn.SiLU(),
19            nn.Conv3d(128, 256, kernel_size=(3, 4, 4), stride=(1, 2, 2), padding=(1, 1, 1)),
20            nn.SiLU(),
21            # Temporal downsampling
22            nn.Conv3d(256, 256, kernel_size=(4, 3, 3), stride=(2, 1, 1), padding=(1, 1, 1)),
23            nn.SiLU(),
24            nn.Conv3d(256, 256, kernel_size=(4, 3, 3), stride=(2, 1, 1), padding=(1, 1, 1)),
25            nn.SiLU(),
26            # To latent
27            nn.Conv3d(256, latent_channels * 2, kernel_size=1),
28        )
29
30        # 3D decoder (mirrors encoder)
31        self.decoder = nn.Sequential(
32            nn.Conv3d(latent_channels, 256, kernel_size=1),
33            # Temporal upsampling
34            nn.ConvTranspose3d(256, 256, kernel_size=(4, 3, 3), stride=(2, 1, 1), padding=(1, 1, 1)),
35            nn.SiLU(),
36            nn.ConvTranspose3d(256, 256, kernel_size=(4, 3, 3), stride=(2, 1, 1), padding=(1, 1, 1)),
37            nn.SiLU(),
38            # Spatial upsampling
39            nn.ConvTranspose3d(256, 128, kernel_size=(3, 4, 4), stride=(1, 2, 2), padding=(1, 1, 1)),
40            nn.SiLU(),
41            nn.ConvTranspose3d(128, 64, kernel_size=(3, 4, 4), stride=(1, 2, 2), padding=(1, 1, 1)),
42            nn.SiLU(),
43            nn.ConvTranspose3d(64, 3, kernel_size=(3, 4, 4), stride=(1, 2, 2), padding=(1, 1, 1)),
44        )
45
46    def encode(self, video):
47        # video: (B, C, T, H, W)
48        h = self.encoder(video)
49        mean, logvar = h.chunk(2, dim=1)
50        # Reparameterization
51        std = torch.exp(0.5 * logvar)
52        z = mean + std * torch.randn_like(std)
53        return z, mean, logvar
54
55    def decode(self, z):
56        return self.decoder(z)

Challenges and Limitations

Current Limitations

LimitationDescriptionResearch Direction
LengthMost models limited to ~10 secondsHierarchical generation, sliding window
ResolutionOften 512p-720p, rarely 4KEfficient architectures, cascaded models
PhysicsObjects often violate physicsPhysics-informed training, simulation data
ConsistencyLong videos still show driftBetter temporal models, memory mechanisms
ControlHard to specify exact motionMotion conditioning, trajectory input
ComputeVery expensive to train and runEfficient attention, quantization

Evaluation Challenges

Evaluating video quality is even harder than images:

  • FVD (Frechet Video Distance): Extension of FID, but doesn't fully capture temporal quality
  • Human evaluation: Gold standard but expensive and subjective
  • Text alignment: Does the video match the prompt? Hard to measure automatically
  • Temporal consistency: Need specialized metrics for flicker, drift, etc.

Summary

Video generation with diffusion models represents a frontier of research with tremendous progress and remaining challenges:

  1. Temporal modeling: Adding attention across time enables cross-frame consistency
  2. Architecture choices: 3D U-Nets, factorized attention, and Diffusion Transformers offer different trade-offs
  3. State-of-the-art systems: Sora, Gen-3, and SVD demonstrate impressive capabilities
  4. Latent video: Working in compressed latent space makes video generation tractable
  5. Open challenges: Length, resolution, physics, and control remain active research areas
Looking Ahead: In the next section, we'll explore 3D generation - creating three-dimensional objects and scenes from text or images using diffusion models.

Key Papers and Resources

  • Video Diffusion Models: Ho et al. (2022) - Original video diffusion work
  • Imagen Video: Ho et al. (2022) - Cascaded video generation
  • Make-A-Video: Singer et al. (2022) - Text-to-video from Meta
  • Sora Technical Report: OpenAI (2024) - Spacetime patch approach
  • Stable Video Diffusion: Stability AI (2023) - Open source img2vid