Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Understand the unique challenges of extending diffusion models from images to video
Explain temporal attention mechanisms and how they model motion and consistency
Compare different video architectures including 3D U-Nets and factorized approaches
Understand autoregressive-diffusion hybrids used in systems like Sora
Appreciate the current limitations and active research directions in video generation

From Images to Video

Video generation represents one of the most exciting frontiers in diffusion modeling. While image generation has achieved remarkable quality, extending to video introduces fundamental new challenges:

Challenge	Image Generation	Video Generation
Dimensionality	H x W x 3	T x H x W x 3 (100x+ more data)
Consistency	Single frame coherence	Cross-frame coherence
Motion	Not applicable	Physics, smooth trajectories
Memory	~8 GB for SDXL	~80+ GB for high-res video
Compute	Seconds per image	Minutes to hours per video

The Core Challenge: A 5-second 1080p video at 24 fps contains 120 frames, each with 2 million pixels. That's 240 million pixels that must be temporally coherent - objects must move smoothly, lighting must be consistent, and physics must be plausible.

Why Temporal Consistency Is Hard

If we naively generate each frame independently using an image model, we get "flickering" - objects randomly change appearance, lighting fluctuates, and motion is jerky. The model has no understanding that frames are connected.

Humans are extremely sensitive to temporal inconsistency. Even subtle flickering in textures or colors is immediately noticeable and distracting. Video models must learn:

Object persistence: The same cat should look the same in frame 1 and frame 100
Motion smoothness: Objects should follow smooth trajectories, not teleport
Physical plausibility: Objects should obey gravity, collisions, etc.
Temporal causality: Effects should follow causes (glass breaks after being hit)

Temporal Modeling

The key innovation in video diffusion models is addingtemporal attention alongside spatial attention. This allows the model to reason about relationships between frames.

Temporal Attention

In addition to attending to all spatial positions within a frame, temporal attention allows each position to attend to the same spatial position across time:

🐍python

1import torch
2import torch.nn as nn
3from einops import rearrange
4
5class TemporalAttention(nn.Module):
6    """
7    Temporal attention for video models.
8    Each spatial position attends to the same position across all frames.
9    """
10
11    def __init__(self, dim, num_heads=8):
12        super().__init__()
13        self.num_heads = num_heads
14        self.head_dim = dim // num_heads
15        self.scale = self.head_dim ** -0.5
16
17        self.to_qkv = nn.Linear(dim, dim * 3)
18        self.to_out = nn.Linear(dim, dim)
19
20    def forward(self, x):
21        # x: (batch, time, height, width, dim)
22        b, t, h, w, d = x.shape
23
24        # Rearrange so each spatial position can attend across time
25        # (batch * height * width, time, dim)
26        x = rearrange(x, 'b t h w d -> (b h w) t d')
27
28        # Standard attention computation
29        qkv = self.to_qkv(x).chunk(3, dim=-1)
30        q, k, v = map(
31            lambda t: rearrange(t, 'n t (h d) -> n h t d', h=self.num_heads),
32            qkv
33        )
34
35        # Attention scores across time dimension
36        attn = torch.matmul(q, k.transpose(-1, -2)) * self.scale
37        attn = attn.softmax(dim=-1)
38
39        # Apply attention
40        out = torch.matmul(attn, v)
41        out = rearrange(out, 'n h t d -> n t (h d)')
42        out = self.to_out(out)
43
44        # Restore spatial dimensions
45        out = rearrange(out, '(b h w) t d -> b t h w d', b=b, h=h, w=w)
46
47        return out

Combined Spatiotemporal Attention

Modern video models typically alternate between spatial and temporal attention, allowing information to flow both within and across frames:

🐍python

1class SpatiotemporalBlock(nn.Module):
2    """
3    Block that applies both spatial and temporal attention.
4    Pattern: Spatial -> Temporal -> FFN
5    """
6
7    def __init__(self, dim, spatial_heads=8, temporal_heads=8):
8        super().__init__()
9
10        # Spatial attention (within each frame)
11        self.spatial_norm = nn.LayerNorm(dim)
12        self.spatial_attn = SpatialAttention(dim, spatial_heads)
13
14        # Temporal attention (across frames)
15        self.temporal_norm = nn.LayerNorm(dim)
16        self.temporal_attn = TemporalAttention(dim, temporal_heads)
17
18        # Feed-forward network
19        self.ffn_norm = nn.LayerNorm(dim)
20        self.ffn = nn.Sequential(
21            nn.Linear(dim, dim * 4),
22            nn.GELU(),
23            nn.Linear(dim * 4, dim),
24        )
25
26    def forward(self, x, t_emb=None):
27        # x: (batch, time, height, width, dim)
28
29        # Spatial attention (within each frame)
30        h = self.spatial_norm(x)
31        h = rearrange(h, 'b t h w d -> (b t) h w d')  # Treat each frame independently
32        h = self.spatial_attn(h)
33        h = rearrange(h, '(b t) h w d -> b t h w d', t=x.shape[1])
34        x = x + h
35
36        # Temporal attention (across frames)
37        h = self.temporal_norm(x)
38        h = self.temporal_attn(h)
39        x = x + h
40
41        # FFN
42        h = self.ffn_norm(x)
43        h = self.ffn(h)
44        x = x + h
45
46        return x

Video Diffusion Architectures

3D U-Net

The most straightforward extension is to use 3D convolutions in the U-Net, allowing the network to jointly process spatial and temporal dimensions:

🐍python

1class Video3DConvBlock(nn.Module):
2    """3D convolution block for video U-Net."""
3
4    def __init__(self, in_channels, out_channels, time_kernel=3):
5        super().__init__()
6
7        # 3D convolution: processes space and time together
8        self.conv = nn.Conv3d(
9            in_channels, out_channels,
10            kernel_size=(time_kernel, 3, 3),
11            padding=(time_kernel // 2, 1, 1),
12        )
13        self.norm = nn.GroupNorm(8, out_channels)
14        self.act = nn.SiLU()
15
16    def forward(self, x):
17        # x: (batch, channels, time, height, width)
18        return self.act(self.norm(self.conv(x)))
19
20class VideoUNet3D(nn.Module):
21    """
22    3D U-Net for video diffusion.
23    Processes video as a 5D tensor: (B, C, T, H, W)
24    """
25
26    def __init__(self, in_channels=4, model_channels=320, num_frames=16):
27        super().__init__()
28        self.num_frames = num_frames
29
30        # Encoder with 3D convolutions
31        self.encoder = nn.ModuleList([
32            Video3DConvBlock(in_channels, model_channels),
33            Video3DConvBlock(model_channels, model_channels * 2),
34            Video3DConvBlock(model_channels * 2, model_channels * 4),
35        ])
36
37        # Bottleneck with spatiotemporal attention
38        self.bottleneck = SpatiotemporalBlock(model_channels * 4)
39
40        # Decoder with 3D convolutions
41        self.decoder = nn.ModuleList([
42            Video3DConvBlock(model_channels * 4, model_channels * 2),
43            Video3DConvBlock(model_channels * 2, model_channels),
44            Video3DConvBlock(model_channels, in_channels),
45        ])
46
47    def forward(self, x, t, context=None):
48        # x: (B, C, T, H, W) - video latents
49        # t: timestep embedding
50        # context: text conditioning
51
52        # Process through U-Net
53        # (Simplified - real implementation has skip connections, time conditioning, etc.)
54        h = x
55        for block in self.encoder:
56            h = block(h)
57
58        h = rearrange(h, 'b c t h w -> b t h w c')
59        h = self.bottleneck(h)
60        h = rearrange(h, 'b t h w c -> b c t h w')
61
62        for block in self.decoder:
63            h = block(h)
64
65        return h

Factorized Approaches

Full 3D attention is computationally expensive. Many systems usefactorized attention - separating spatial and temporal processing:

Approach	Description	Trade-off
Full 3D	Every position attends to all spatiotemporal positions	Best quality, highest cost
Spatial + Temporal	Separate 2D spatial and 1D temporal attention	Good balance, most common
Causal Temporal	Frame t only attends to frames < t	Enables autoregressive generation
Windowed Temporal	Attend to nearby frames only	Efficient for long videos

State-of-the-Art Systems

Several breakthrough systems have demonstrated remarkable video generation capabilities:

OpenAI Sora (2024)

Sora represents a paradigm shift, generating minute-long videos with unprecedented consistency and quality. Key innovations:

Diffusion Transformer (DiT): Uses transformer architecture instead of U-Net, scaling more efficiently
Spacetime Patches: Treats video as 3D patches, like ViT but extended to video
Variable Resolution/Duration: Trained on diverse aspect ratios and lengths
Recaptioning: Uses LLMs to create detailed training captions

🐍python

1class DiTBlock(nn.Module):
2    """
3    Diffusion Transformer block (conceptual implementation).
4    Uses adaptive layer norm (adaLN) for conditioning.
5    """
6
7    def __init__(self, dim, num_heads):
8        super().__init__()
9        self.attention = nn.MultiheadAttention(dim, num_heads)
10        self.ffn = nn.Sequential(
11            nn.Linear(dim, dim * 4),
12            nn.GELU(),
13            nn.Linear(dim * 4, dim),
14        )
15
16        # Adaptive LayerNorm for conditioning
17        self.adaln = nn.Sequential(
18            nn.SiLU(),
19            nn.Linear(dim, dim * 6),  # shift, scale for attn and ffn
20        )
21
22    def forward(self, x, t_emb):
23        # Get conditioning scales and shifts
24        adaln_out = self.adaln(t_emb)
25        shift_attn, scale_attn, gate_attn, shift_ffn, scale_ffn, gate_ffn =             adaln_out.chunk(6, dim=-1)
26
27        # Modulated attention
28        h = x * (1 + scale_attn.unsqueeze(1)) + shift_attn.unsqueeze(1)
29        h = self.attention(h, h, h)[0]
30        x = x + gate_attn.unsqueeze(1) * h
31
32        # Modulated FFN
33        h = x * (1 + scale_ffn.unsqueeze(1)) + shift_ffn.unsqueeze(1)
34        h = self.ffn(h)
35        x = x + gate_ffn.unsqueeze(1) * h
36
37        return x

Runway Gen-3 Alpha

High-fidelity motion: Excellent at complex camera movements and subject motion
Image-to-video: Can animate static images with specified motion
Multi-modal conditioning: Text, image, or reference motion

Stable Video Diffusion (Open Source)

🐍python

1from diffusers import StableVideoDiffusionPipeline
2from diffusers.utils import load_image, export_to_video
3import torch
4
5# Load the pipeline
6pipe = StableVideoDiffusionPipeline.from_pretrained(
7    "stabilityai/stable-video-diffusion-img2vid-xt",
8    torch_dtype=torch.float16,
9    variant="fp16",
10)
11pipe.to("cuda")
12
13# Load a conditioning image
14image = load_image("input_image.png")
15image = image.resize((1024, 576))  # SVD requires specific resolution
16
17# Generate video frames
18generator = torch.Generator("cuda").manual_seed(42)
19
20frames = pipe(
21    image,
22    decode_chunk_size=8,  # Memory optimization
23    generator=generator,
24    num_frames=25,  # Generate 25 frames
25    motion_bucket_id=127,  # Control amount of motion (0-255)
26    noise_aug_strength=0.02,  # Noise augmentation for conditioning image
27).frames[0]
28
29# Export to video file
30export_to_video(frames, "output_video.mp4", fps=7)

Pika Labs

Consumer-focused: Optimized for accessibility and ease of use
Lip sync: Specialized features for talking head generation
Style transfer: Video-to-video style transformation

Implementation Concepts

Training Data Considerations

Video training data is more complex than images:

Caption quality: Videos need frame-aligned or segment-level captions describing action, not just objects
Motion diversity: Training set must include diverse camera movements, subject motions, and scene dynamics
Temporal resolution: Different frame rates capture different types of motion
Quality filtering: Remove static videos, watermarks, poor quality clips

Latent Video Diffusion

Like image models, video models typically work in latent space for efficiency. The VAE must be extended to handle temporal dimensions:

🐍python

1class VideoVAE(nn.Module):
2    """
3    Video VAE that compresses along spatial AND temporal dimensions.
4    Typical compression: 8x spatial, 4x temporal.
5    """
6
7    def __init__(self, latent_channels=4, spatial_compress=8, temporal_compress=4):
8        super().__init__()
9        self.spatial_compress = spatial_compress
10        self.temporal_compress = temporal_compress
11
12        # 3D encoder
13        self.encoder = nn.Sequential(
14            # Spatial downsampling
15            nn.Conv3d(3, 64, kernel_size=(3, 4, 4), stride=(1, 2, 2), padding=(1, 1, 1)),
16            nn.SiLU(),
17            nn.Conv3d(64, 128, kernel_size=(3, 4, 4), stride=(1, 2, 2), padding=(1, 1, 1)),
18            nn.SiLU(),
19            nn.Conv3d(128, 256, kernel_size=(3, 4, 4), stride=(1, 2, 2), padding=(1, 1, 1)),
20            nn.SiLU(),
21            # Temporal downsampling
22            nn.Conv3d(256, 256, kernel_size=(4, 3, 3), stride=(2, 1, 1), padding=(1, 1, 1)),
23            nn.SiLU(),
24            nn.Conv3d(256, 256, kernel_size=(4, 3, 3), stride=(2, 1, 1), padding=(1, 1, 1)),
25            nn.SiLU(),
26            # To latent
27            nn.Conv3d(256, latent_channels * 2, kernel_size=1),
28        )
29
30        # 3D decoder (mirrors encoder)
31        self.decoder = nn.Sequential(
32            nn.Conv3d(latent_channels, 256, kernel_size=1),
33            # Temporal upsampling
34            nn.ConvTranspose3d(256, 256, kernel_size=(4, 3, 3), stride=(2, 1, 1), padding=(1, 1, 1)),
35            nn.SiLU(),
36            nn.ConvTranspose3d(256, 256, kernel_size=(4, 3, 3), stride=(2, 1, 1), padding=(1, 1, 1)),
37            nn.SiLU(),
38            # Spatial upsampling
39            nn.ConvTranspose3d(256, 128, kernel_size=(3, 4, 4), stride=(1, 2, 2), padding=(1, 1, 1)),
40            nn.SiLU(),
41            nn.ConvTranspose3d(128, 64, kernel_size=(3, 4, 4), stride=(1, 2, 2), padding=(1, 1, 1)),
42            nn.SiLU(),
43            nn.ConvTranspose3d(64, 3, kernel_size=(3, 4, 4), stride=(1, 2, 2), padding=(1, 1, 1)),
44        )
45
46    def encode(self, video):
47        # video: (B, C, T, H, W)
48        h = self.encoder(video)
49        mean, logvar = h.chunk(2, dim=1)
50        # Reparameterization
51        std = torch.exp(0.5 * logvar)
52        z = mean + std * torch.randn_like(std)
53        return z, mean, logvar
54
55    def decode(self, z):
56        return self.decoder(z)

Challenges and Limitations

Current Limitations

Limitation	Description	Research Direction
Length	Most models limited to ~10 seconds	Hierarchical generation, sliding window
Resolution	Often 512p-720p, rarely 4K	Efficient architectures, cascaded models
Physics	Objects often violate physics	Physics-informed training, simulation data
Consistency	Long videos still show drift	Better temporal models, memory mechanisms
Control	Hard to specify exact motion	Motion conditioning, trajectory input
Compute	Very expensive to train and run	Efficient attention, quantization

Evaluation Challenges

Evaluating video quality is even harder than images:

FVD (Frechet Video Distance): Extension of FID, but doesn't fully capture temporal quality
Human evaluation: Gold standard but expensive and subjective
Text alignment: Does the video match the prompt? Hard to measure automatically
Temporal consistency: Need specialized metrics for flicker, drift, etc.

Summary

Video generation with diffusion models represents a frontier of research with tremendous progress and remaining challenges:

Temporal modeling: Adding attention across time enables cross-frame consistency
Architecture choices: 3D U-Nets, factorized attention, and Diffusion Transformers offer different trade-offs
State-of-the-art systems: Sora, Gen-3, and SVD demonstrate impressive capabilities
Latent video: Working in compressed latent space makes video generation tractable
Open challenges: Length, resolution, physics, and control remain active research areas

Looking Ahead: In the next section, we'll explore 3D generation - creating three-dimensional objects and scenes from text or images using diffusion models.

Key Papers and Resources

Video Diffusion Models: Ho et al. (2022) - Original video diffusion work
Imagen Video: Ho et al. (2022) - Cascaded video generation
Make-A-Video: Singer et al. (2022) - Text-to-video from Meta
Sora Technical Report: OpenAI (2024) - Spacetime patch approach
Stable Video Diffusion: Stability AI (2023) - Open source img2vid