Learning Objectives
By the end of this section, you will:
- Implement ResBlocks with Group Normalization and residual connections
- Understand why GroupNorm is preferred over BatchNorm for diffusion models
- Master SiLU activation and why it outperforms ReLU for generation tasks
- Build downsampling and upsampling blocks for the encoder and decoder paths
- Combine blocks into complete encoder and decoder modules
Why This Matters
ResBlock Fundamentals
The ResBlock (Residual Block) is the core building block of diffusion U-Nets. It combines several key innovations:
- Residual connections: Allow gradients to flow directly through the network
- Normalization: Stabilize training by normalizing intermediate activations
- Time conditioning: Inject timestep information to modulate network behavior
- Nonlinearity: SiLU activations for smooth, non-saturating gradients
The general structure of a ResBlock for diffusion models is:
where is the learned transformation and is the time embedding. The addition of is the residual (skip) connection.
Pre-activation vs Post-activation
Group Normalization
Normalization is critical for training deep networks. In diffusion models, we use Group Normalization (GroupNorm) instead of Batch Normalization.
Why Not BatchNorm?
BatchNorm computes statistics across the batch dimension, which has several problems for diffusion:
- Small batch sizes: GPU memory limits batch sizes, making batch statistics noisy
- Inference inconsistency: Running statistics differ from training, causing issues during sampling
- Sample independence: Each image should be processed independently during generation
GroupNorm Explained
GroupNorm divides channels into groups and normalizes within each group, independently for each sample:
where is the group index for channel .
| Normalization | Computed Over | Best For | Batch Size Sensitivity |
|---|---|---|---|
| BatchNorm | N, H, W (per channel) | Classification, fixed batch | High |
| LayerNorm | C, H, W (per sample) | Transformers, NLP | None |
| InstanceNorm | H, W (per channel, sample) | Style transfer | None |
| GroupNorm | Groups of C, H, W (per sample) | Diffusion, small batch | None |
Choosing Number of Groups
Activation Functions
Modern diffusion models use SiLU (Sigmoid Linear Unit), also known as Swish, instead of ReLU:
Why SiLU Over ReLU?
| Property | ReLU | SiLU |
|---|---|---|
| Formula | max(0, x) | x * sigmoid(x) |
| Smoothness | Not smooth at 0 | Smooth everywhere |
| Gradient at x=0 | Undefined | 0.5 |
| Negative values | Always 0 | Slightly negative |
| Self-gating | No | Yes (sigmoid gates) |
The key advantages of SiLU for diffusion:
- Smooth gradients: No sharp transition at zero, leading to more stable training
- Non-monotonic: Can output small negative values, enabling richer representations
- Self-gating: The sigmoid modulates the linear part, similar to attention mechanisms
1import torch
2import torch.nn.functional as F
3
4# SiLU activation (built into PyTorch)
5x = torch.randn(4, 64, 32, 32)
6
7# Method 1: Using nn.SiLU module
8silu = torch.nn.SiLU()
9y = silu(x)
10
11# Method 2: Using functional API
12y = F.silu(x)
13
14# Method 3: Manual implementation
15y = x * torch.sigmoid(x)
16
17# All three are equivalent!GELU vs SiLU
Convolution Layers
The convolutions in diffusion U-Nets are standard 2D convolutions with specific choices:
- Kernel size: 3x3 is the standard choice, providing a good balance between receptive field and computational cost
- Padding: padding=1 preserves spatial dimensions (for 3x3 kernels)
- Stride: stride=1 for feature processing, stride=2 for downsampling
- Bias: Often disabled when followed by normalization (which has its own bias)
1import torch.nn as nn
2
3# Standard 3x3 convolution preserving spatial size
4conv = nn.Conv2d(
5 in_channels=128,
6 out_channels=256,
7 kernel_size=3,
8 stride=1,
9 padding=1,
10 bias=False # Disabled when using GroupNorm
11)
12
13# Spatial dimensions: H_out = H_in for padding=1, stride=1
14
15# 1x1 convolution for channel mixing / projection
16proj = nn.Conv2d(128, 256, kernel_size=1, bias=False)
17
18# Strided convolution for downsampling (reduces size by 2)
19downsample = nn.Conv2d(128, 256, kernel_size=3, stride=2, padding=1)Complete ResBlock Implementation
Now let's implement a complete ResBlock for diffusion models with time conditioning:
Time Embedding Broadcasting
time_emb[:, :, None, None] broadcasts to all spatial locations. This means the same time modulation applies uniformly across the image.Downsampling Blocks
The encoder path of U-Net reduces spatial resolution at each level. There are two common approaches:
Strided Conv vs Pooling
Upsampling Blocks
The decoder path increases spatial resolution. Again, there are multiple approaches:
Checkerboard Artifacts
Summary
In this section, we implemented the fundamental building blocks of diffusion U-Nets:
- ResBlock: The core module combining normalization, activation, convolution, time conditioning, and residual connections
- GroupNorm: Normalization that works with any batch size, essential for diffusion training and inference
- SiLU activation: Smooth, self-gating activation that improves gradient flow compared to ReLU
- Downsampling: Strided convolutions or pooling to reduce spatial resolution in the encoder
- Upsampling: Interpolation + convolution to increase resolution in the decoder while avoiding artifacts