Learning Objectives
By the end of this section, you will be able to:
- Understand the unique challenges of extending diffusion models from images to video
- Explain temporal attention mechanisms and how they model motion and consistency
- Compare different video architectures including 3D U-Nets and factorized approaches
- Understand autoregressive-diffusion hybrids used in systems like Sora
- Appreciate the current limitations and active research directions in video generation
From Images to Video
Video generation represents one of the most exciting frontiers in diffusion modeling. While image generation has achieved remarkable quality, extending to video introduces fundamental new challenges:
| Challenge | Image Generation | Video Generation |
|---|---|---|
| Dimensionality | H x W x 3 | T x H x W x 3 (100x+ more data) |
| Consistency | Single frame coherence | Cross-frame coherence |
| Motion | Not applicable | Physics, smooth trajectories |
| Memory | ~8 GB for SDXL | ~80+ GB for high-res video |
| Compute | Seconds per image | Minutes to hours per video |
The Core Challenge: A 5-second 1080p video at 24 fps contains 120 frames, each with 2 million pixels. That's 240 million pixels that must be temporally coherent - objects must move smoothly, lighting must be consistent, and physics must be plausible.
Why Temporal Consistency Is Hard
If we naively generate each frame independently using an image model, we get "flickering" - objects randomly change appearance, lighting fluctuates, and motion is jerky. The model has no understanding that frames are connected.
Humans are extremely sensitive to temporal inconsistency. Even subtle flickering in textures or colors is immediately noticeable and distracting. Video models must learn:
- Object persistence: The same cat should look the same in frame 1 and frame 100
- Motion smoothness: Objects should follow smooth trajectories, not teleport
- Physical plausibility: Objects should obey gravity, collisions, etc.
- Temporal causality: Effects should follow causes (glass breaks after being hit)
Temporal Modeling
The key innovation in video diffusion models is addingtemporal attention alongside spatial attention. This allows the model to reason about relationships between frames.
Temporal Attention
In addition to attending to all spatial positions within a frame, temporal attention allows each position to attend to the same spatial position across time:
1import torch
2import torch.nn as nn
3from einops import rearrange
4
5class TemporalAttention(nn.Module):
6 """
7 Temporal attention for video models.
8 Each spatial position attends to the same position across all frames.
9 """
10
11 def __init__(self, dim, num_heads=8):
12 super().__init__()
13 self.num_heads = num_heads
14 self.head_dim = dim // num_heads
15 self.scale = self.head_dim ** -0.5
16
17 self.to_qkv = nn.Linear(dim, dim * 3)
18 self.to_out = nn.Linear(dim, dim)
19
20 def forward(self, x):
21 # x: (batch, time, height, width, dim)
22 b, t, h, w, d = x.shape
23
24 # Rearrange so each spatial position can attend across time
25 # (batch * height * width, time, dim)
26 x = rearrange(x, 'b t h w d -> (b h w) t d')
27
28 # Standard attention computation
29 qkv = self.to_qkv(x).chunk(3, dim=-1)
30 q, k, v = map(
31 lambda t: rearrange(t, 'n t (h d) -> n h t d', h=self.num_heads),
32 qkv
33 )
34
35 # Attention scores across time dimension
36 attn = torch.matmul(q, k.transpose(-1, -2)) * self.scale
37 attn = attn.softmax(dim=-1)
38
39 # Apply attention
40 out = torch.matmul(attn, v)
41 out = rearrange(out, 'n h t d -> n t (h d)')
42 out = self.to_out(out)
43
44 # Restore spatial dimensions
45 out = rearrange(out, '(b h w) t d -> b t h w d', b=b, h=h, w=w)
46
47 return outCombined Spatiotemporal Attention
Modern video models typically alternate between spatial and temporal attention, allowing information to flow both within and across frames:
1class SpatiotemporalBlock(nn.Module):
2 """
3 Block that applies both spatial and temporal attention.
4 Pattern: Spatial -> Temporal -> FFN
5 """
6
7 def __init__(self, dim, spatial_heads=8, temporal_heads=8):
8 super().__init__()
9
10 # Spatial attention (within each frame)
11 self.spatial_norm = nn.LayerNorm(dim)
12 self.spatial_attn = SpatialAttention(dim, spatial_heads)
13
14 # Temporal attention (across frames)
15 self.temporal_norm = nn.LayerNorm(dim)
16 self.temporal_attn = TemporalAttention(dim, temporal_heads)
17
18 # Feed-forward network
19 self.ffn_norm = nn.LayerNorm(dim)
20 self.ffn = nn.Sequential(
21 nn.Linear(dim, dim * 4),
22 nn.GELU(),
23 nn.Linear(dim * 4, dim),
24 )
25
26 def forward(self, x, t_emb=None):
27 # x: (batch, time, height, width, dim)
28
29 # Spatial attention (within each frame)
30 h = self.spatial_norm(x)
31 h = rearrange(h, 'b t h w d -> (b t) h w d') # Treat each frame independently
32 h = self.spatial_attn(h)
33 h = rearrange(h, '(b t) h w d -> b t h w d', t=x.shape[1])
34 x = x + h
35
36 # Temporal attention (across frames)
37 h = self.temporal_norm(x)
38 h = self.temporal_attn(h)
39 x = x + h
40
41 # FFN
42 h = self.ffn_norm(x)
43 h = self.ffn(h)
44 x = x + h
45
46 return xVideo Diffusion Architectures
3D U-Net
The most straightforward extension is to use 3D convolutions in the U-Net, allowing the network to jointly process spatial and temporal dimensions:
1class Video3DConvBlock(nn.Module):
2 """3D convolution block for video U-Net."""
3
4 def __init__(self, in_channels, out_channels, time_kernel=3):
5 super().__init__()
6
7 # 3D convolution: processes space and time together
8 self.conv = nn.Conv3d(
9 in_channels, out_channels,
10 kernel_size=(time_kernel, 3, 3),
11 padding=(time_kernel // 2, 1, 1),
12 )
13 self.norm = nn.GroupNorm(8, out_channels)
14 self.act = nn.SiLU()
15
16 def forward(self, x):
17 # x: (batch, channels, time, height, width)
18 return self.act(self.norm(self.conv(x)))
19
20class VideoUNet3D(nn.Module):
21 """
22 3D U-Net for video diffusion.
23 Processes video as a 5D tensor: (B, C, T, H, W)
24 """
25
26 def __init__(self, in_channels=4, model_channels=320, num_frames=16):
27 super().__init__()
28 self.num_frames = num_frames
29
30 # Encoder with 3D convolutions
31 self.encoder = nn.ModuleList([
32 Video3DConvBlock(in_channels, model_channels),
33 Video3DConvBlock(model_channels, model_channels * 2),
34 Video3DConvBlock(model_channels * 2, model_channels * 4),
35 ])
36
37 # Bottleneck with spatiotemporal attention
38 self.bottleneck = SpatiotemporalBlock(model_channels * 4)
39
40 # Decoder with 3D convolutions
41 self.decoder = nn.ModuleList([
42 Video3DConvBlock(model_channels * 4, model_channels * 2),
43 Video3DConvBlock(model_channels * 2, model_channels),
44 Video3DConvBlock(model_channels, in_channels),
45 ])
46
47 def forward(self, x, t, context=None):
48 # x: (B, C, T, H, W) - video latents
49 # t: timestep embedding
50 # context: text conditioning
51
52 # Process through U-Net
53 # (Simplified - real implementation has skip connections, time conditioning, etc.)
54 h = x
55 for block in self.encoder:
56 h = block(h)
57
58 h = rearrange(h, 'b c t h w -> b t h w c')
59 h = self.bottleneck(h)
60 h = rearrange(h, 'b t h w c -> b c t h w')
61
62 for block in self.decoder:
63 h = block(h)
64
65 return hFactorized Approaches
Full 3D attention is computationally expensive. Many systems usefactorized attention - separating spatial and temporal processing:
| Approach | Description | Trade-off |
|---|---|---|
| Full 3D | Every position attends to all spatiotemporal positions | Best quality, highest cost |
| Spatial + Temporal | Separate 2D spatial and 1D temporal attention | Good balance, most common |
| Causal Temporal | Frame t only attends to frames < t | Enables autoregressive generation |
| Windowed Temporal | Attend to nearby frames only | Efficient for long videos |
State-of-the-Art Systems
Several breakthrough systems have demonstrated remarkable video generation capabilities:
OpenAI Sora (2024)
Sora represents a paradigm shift, generating minute-long videos with unprecedented consistency and quality. Key innovations:
- Diffusion Transformer (DiT): Uses transformer architecture instead of U-Net, scaling more efficiently
- Spacetime Patches: Treats video as 3D patches, like ViT but extended to video
- Variable Resolution/Duration: Trained on diverse aspect ratios and lengths
- Recaptioning: Uses LLMs to create detailed training captions
1class DiTBlock(nn.Module):
2 """
3 Diffusion Transformer block (conceptual implementation).
4 Uses adaptive layer norm (adaLN) for conditioning.
5 """
6
7 def __init__(self, dim, num_heads):
8 super().__init__()
9 self.attention = nn.MultiheadAttention(dim, num_heads)
10 self.ffn = nn.Sequential(
11 nn.Linear(dim, dim * 4),
12 nn.GELU(),
13 nn.Linear(dim * 4, dim),
14 )
15
16 # Adaptive LayerNorm for conditioning
17 self.adaln = nn.Sequential(
18 nn.SiLU(),
19 nn.Linear(dim, dim * 6), # shift, scale for attn and ffn
20 )
21
22 def forward(self, x, t_emb):
23 # Get conditioning scales and shifts
24 adaln_out = self.adaln(t_emb)
25 shift_attn, scale_attn, gate_attn, shift_ffn, scale_ffn, gate_ffn = adaln_out.chunk(6, dim=-1)
26
27 # Modulated attention
28 h = x * (1 + scale_attn.unsqueeze(1)) + shift_attn.unsqueeze(1)
29 h = self.attention(h, h, h)[0]
30 x = x + gate_attn.unsqueeze(1) * h
31
32 # Modulated FFN
33 h = x * (1 + scale_ffn.unsqueeze(1)) + shift_ffn.unsqueeze(1)
34 h = self.ffn(h)
35 x = x + gate_ffn.unsqueeze(1) * h
36
37 return xRunway Gen-3 Alpha
- High-fidelity motion: Excellent at complex camera movements and subject motion
- Image-to-video: Can animate static images with specified motion
- Multi-modal conditioning: Text, image, or reference motion
Stable Video Diffusion (Open Source)
1from diffusers import StableVideoDiffusionPipeline
2from diffusers.utils import load_image, export_to_video
3import torch
4
5# Load the pipeline
6pipe = StableVideoDiffusionPipeline.from_pretrained(
7 "stabilityai/stable-video-diffusion-img2vid-xt",
8 torch_dtype=torch.float16,
9 variant="fp16",
10)
11pipe.to("cuda")
12
13# Load a conditioning image
14image = load_image("input_image.png")
15image = image.resize((1024, 576)) # SVD requires specific resolution
16
17# Generate video frames
18generator = torch.Generator("cuda").manual_seed(42)
19
20frames = pipe(
21 image,
22 decode_chunk_size=8, # Memory optimization
23 generator=generator,
24 num_frames=25, # Generate 25 frames
25 motion_bucket_id=127, # Control amount of motion (0-255)
26 noise_aug_strength=0.02, # Noise augmentation for conditioning image
27).frames[0]
28
29# Export to video file
30export_to_video(frames, "output_video.mp4", fps=7)Pika Labs
- Consumer-focused: Optimized for accessibility and ease of use
- Lip sync: Specialized features for talking head generation
- Style transfer: Video-to-video style transformation
Implementation Concepts
Training Data Considerations
Video training data is more complex than images:
- Caption quality: Videos need frame-aligned or segment-level captions describing action, not just objects
- Motion diversity: Training set must include diverse camera movements, subject motions, and scene dynamics
- Temporal resolution: Different frame rates capture different types of motion
- Quality filtering: Remove static videos, watermarks, poor quality clips
Latent Video Diffusion
Like image models, video models typically work in latent space for efficiency. The VAE must be extended to handle temporal dimensions:
1class VideoVAE(nn.Module):
2 """
3 Video VAE that compresses along spatial AND temporal dimensions.
4 Typical compression: 8x spatial, 4x temporal.
5 """
6
7 def __init__(self, latent_channels=4, spatial_compress=8, temporal_compress=4):
8 super().__init__()
9 self.spatial_compress = spatial_compress
10 self.temporal_compress = temporal_compress
11
12 # 3D encoder
13 self.encoder = nn.Sequential(
14 # Spatial downsampling
15 nn.Conv3d(3, 64, kernel_size=(3, 4, 4), stride=(1, 2, 2), padding=(1, 1, 1)),
16 nn.SiLU(),
17 nn.Conv3d(64, 128, kernel_size=(3, 4, 4), stride=(1, 2, 2), padding=(1, 1, 1)),
18 nn.SiLU(),
19 nn.Conv3d(128, 256, kernel_size=(3, 4, 4), stride=(1, 2, 2), padding=(1, 1, 1)),
20 nn.SiLU(),
21 # Temporal downsampling
22 nn.Conv3d(256, 256, kernel_size=(4, 3, 3), stride=(2, 1, 1), padding=(1, 1, 1)),
23 nn.SiLU(),
24 nn.Conv3d(256, 256, kernel_size=(4, 3, 3), stride=(2, 1, 1), padding=(1, 1, 1)),
25 nn.SiLU(),
26 # To latent
27 nn.Conv3d(256, latent_channels * 2, kernel_size=1),
28 )
29
30 # 3D decoder (mirrors encoder)
31 self.decoder = nn.Sequential(
32 nn.Conv3d(latent_channels, 256, kernel_size=1),
33 # Temporal upsampling
34 nn.ConvTranspose3d(256, 256, kernel_size=(4, 3, 3), stride=(2, 1, 1), padding=(1, 1, 1)),
35 nn.SiLU(),
36 nn.ConvTranspose3d(256, 256, kernel_size=(4, 3, 3), stride=(2, 1, 1), padding=(1, 1, 1)),
37 nn.SiLU(),
38 # Spatial upsampling
39 nn.ConvTranspose3d(256, 128, kernel_size=(3, 4, 4), stride=(1, 2, 2), padding=(1, 1, 1)),
40 nn.SiLU(),
41 nn.ConvTranspose3d(128, 64, kernel_size=(3, 4, 4), stride=(1, 2, 2), padding=(1, 1, 1)),
42 nn.SiLU(),
43 nn.ConvTranspose3d(64, 3, kernel_size=(3, 4, 4), stride=(1, 2, 2), padding=(1, 1, 1)),
44 )
45
46 def encode(self, video):
47 # video: (B, C, T, H, W)
48 h = self.encoder(video)
49 mean, logvar = h.chunk(2, dim=1)
50 # Reparameterization
51 std = torch.exp(0.5 * logvar)
52 z = mean + std * torch.randn_like(std)
53 return z, mean, logvar
54
55 def decode(self, z):
56 return self.decoder(z)Challenges and Limitations
Current Limitations
| Limitation | Description | Research Direction |
|---|---|---|
| Length | Most models limited to ~10 seconds | Hierarchical generation, sliding window |
| Resolution | Often 512p-720p, rarely 4K | Efficient architectures, cascaded models |
| Physics | Objects often violate physics | Physics-informed training, simulation data |
| Consistency | Long videos still show drift | Better temporal models, memory mechanisms |
| Control | Hard to specify exact motion | Motion conditioning, trajectory input |
| Compute | Very expensive to train and run | Efficient attention, quantization |
Evaluation Challenges
Evaluating video quality is even harder than images:
- FVD (Frechet Video Distance): Extension of FID, but doesn't fully capture temporal quality
- Human evaluation: Gold standard but expensive and subjective
- Text alignment: Does the video match the prompt? Hard to measure automatically
- Temporal consistency: Need specialized metrics for flicker, drift, etc.
Summary
Video generation with diffusion models represents a frontier of research with tremendous progress and remaining challenges:
- Temporal modeling: Adding attention across time enables cross-frame consistency
- Architecture choices: 3D U-Nets, factorized attention, and Diffusion Transformers offer different trade-offs
- State-of-the-art systems: Sora, Gen-3, and SVD demonstrate impressive capabilities
- Latent video: Working in compressed latent space makes video generation tractable
- Open challenges: Length, resolution, physics, and control remain active research areas
Looking Ahead: In the next section, we'll explore 3D generation - creating three-dimensional objects and scenes from text or images using diffusion models.
Key Papers and Resources
- Video Diffusion Models: Ho et al. (2022) - Original video diffusion work
- Imagen Video: Ho et al. (2022) - Cascaded video generation
- Make-A-Video: Singer et al. (2022) - Text-to-video from Meta
- Sora Technical Report: OpenAI (2024) - Spacetime patch approach
- Stable Video Diffusion: Stability AI (2023) - Open source img2vid