Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Explain the motivation for ControlNet and why text prompts alone are insufficient for precise spatial control
Describe the ControlNet architecture including the trainable copy and zero-convolution connections
Understand why zero-convolution is crucial for training stability and preserving pretrained model quality
List common condition types (edges, depth, pose, segmentation) and their use cases
Implement ControlNet inference with adjustable control strength

Motivation: Beyond Text Prompts

Text-to-image models like Stable Diffusion are remarkably capable, but text has inherent limitations for spatial control:

"Cat on the left, dog on the right" - often fails to position correctly
Precise pose control - describing complex poses in words is nearly impossible
Architectural layouts - "L-shaped building" gives unpredictable results
Edge preservation - keeping exact outline while changing style

The Challenge: How do we add precise spatial control to a pretrained diffusion model without retraining the entire thing? And without destroying what it already learned?

ControlNet, introduced by Zhang et al. in 2023, solves this elegantly by creating a trainable copy of the encoder that processes spatial conditions, while keeping the original model frozen.

ControlNet Architecture

ControlNet's architecture is deceptively simple but remarkably effective:

The Two-Network Design

Frozen U-Net: The original Stable Diffusion U-Net, weights locked
Trainable Copy: A copy of the encoder portion, initialized from pretrained weights
Zero Convolutions: Connect the trainable copy to the frozen network

Why Copy the Encoder?

The trainable copy starts with the same weights as the frozen encoder. This means:

Inherited knowledge: Starts with understanding of image features
Efficient training: Only needs to learn the delta for conditions
Quality preservation: Frozen decoder maintains output quality

Connection Points

The trainable copy connects to the frozen U-Net at multiple resolutions:

Resolution	Channels	Connection
64x64	320	Encoder block 1 output
32x32	640	Encoder block 2 output
16x16	1280	Encoder block 3 output
8x8	1280	Middle block output

Zero Convolution Trick

The key innovation enabling stable ControlNet training is the zero convolution:

Zero Convolution

🐍zero_conv.py

Explanation(4)

Code(16)

Zero convolution is a conv layer initialized to all zeros

5Zero Init

Both weight and bias start at zero - output is initially zero

10Gradual Learning

During training, weights gradually become non-zero

13Safe Injection

At start, ControlNet adds nothing to frozen U-Net - preserves original behavior

12 lines without explanation

1class ZeroConv2d(nn.Module):
2    """Convolution initialized to zero for safe feature injection."""
3
4    def __init__(self, in_channels, out_channels, kernel_size=1):
5        super().__init__()
6        self.conv = nn.Conv2d(in_channels, out_channels, kernel_size)
7        # Initialize both weights and bias to zero
8        nn.init.zeros_(self.conv.weight)
9        nn.init.zeros_(self.conv.bias)
10
11    def forward(self, x):
12        # Initially outputs zeros, gradually learns during training
13        return self.conv(x)
14
15# In ControlNet: condition features pass through zero_conv before
16# being added to the frozen U-Net. This ensures initial output = 0.

Why Zero Initialization?

At the start of training:

Zero-conv output is zero: The trainable copy adds nothing to the frozen network
Frozen model works normally: Outputs are exactly as before ControlNet
No catastrophic interference: The pretrained quality is preserved

As training progresses:

Gradients flow back: Zero-conv weights gradually become non-zero
Control signal emerges: The trainable copy learns to produce useful conditioning
Smooth transition: The model smoothly learns to use conditions

Analogy: Imagine adding a new instrument to a symphony orchestra. Zero-conv is like having that instrument play at zero volume initially, then gradually increasing as the musicians learn to play together.

Types of Spatial Conditions

ControlNet can be trained on various condition types, each offering different control capabilities:

Edge Detection (Canny)

Input: Black/white edge image from Canny detector
Control: Precise outline and shape preservation
Use case: Style transfer while keeping structure

Depth Maps

Input: Grayscale depth map (MiDaS, ZoeDepth)
Control: 3D spatial arrangement and perspective
Use case: Consistent scene geometry, interior design

Human Pose (OpenPose)

Input: Skeleton keypoints visualization
Control: Body position, limb angles, gesture
Use case: Character poses, action shots

Semantic Segmentation

Input: Color-coded region map
Control: What goes where (sky, building, person, etc.)
Use case: Scene composition, layout control

Condition Type	Precision	Semantic Info	Best For
Canny Edge	Very High	Low	Style transfer
Depth	High	Medium	3D scenes
OpenPose	Medium	High	Character poses
Segmentation	Medium	Very High	Scene layout
Normal Map	High	Medium	Surface details
Scribble	Low	Low	Quick sketches

Training ControlNet

Training a ControlNet is surprisingly efficient because most of the network is frozen:

Training Setup

Frozen: Original U-Net (encoder + middle + decoder)
Trainable: Copy of encoder + zero convolutions
Parameters: ~300M trainable out of ~1.2B total

Data Requirements

Condition Type	Training Images	Training Time
Canny Edge	~500K	~3 days on 8 A100s
Depth	~500K	~3 days on 8 A100s
OpenPose	~200K	~2 days on 8 A100s
Custom	~50K-100K	~1-2 days on 8 A100s

Loss Function

ControlNet uses the same diffusion loss as standard training:

$\mathcal{L} = \mathbb{E}_{z_0, t, c, \epsilon}\left[\|\epsilon - \epsilon_\theta(z_t, t, c_{\text{text}}, c_{\text{spatial}})\|^2\right]$

Where $c_{\text{spatial}}$ is the spatial condition (edges, depth, etc.) and $c_{\text{text}}$ is the text prompt.

Prompt Dropout

During training, text prompts are randomly dropped (replaced with empty string) 50% of the time. This teaches the model to work with both text+condition and condition-only inputs.

Inference with ControlNet

At inference time, ControlNet adds control signals to the generation process:

ControlNet Inference

🐍controlnet_inference.py

Explanation(5)

Code(22)

ControlNet inference - adding spatial control to Stable Diffusion

5Condition

Prepare condition image (e.g., Canny edges) at model resolution

9Encode

ControlNet processes condition through its copy of encoder

13Add Control

Control signals added to U-Net at corresponding resolutions

17Blend

Control strength can be adjusted (0.0 = no control, 1.0 = full)

17 lines without explanation

1from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
2
3# Load models
4controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny")
5pipe = StableDiffusionControlNetPipeline.from_pretrained(
6    "runwayml/stable-diffusion-v1-5", controlnet=controlnet
7)
8
9# Prepare condition image (e.g., Canny edges)
10condition_image = cv2.Canny(input_image, 100, 200)
11condition_image = Image.fromarray(condition_image)
12
13# Generate with control
14image = pipe(
15    prompt="A beautiful landscape painting",
16    image=condition_image,
17    controlnet_conditioning_scale=1.0,  # Strength of control
18    num_inference_steps=50
19).images[0]
20
21# controlnet_conditioning_scale: 0.0 = no control, 1.0 = full control
22# Values > 1.0 are possible for stronger conditioning

Control Strength

The $\texttt{controlnet\_conditioning\_scale}$ parameter adjusts how strongly the condition influences generation:

Scale	Effect	When to Use
0.0	No control, pure text-to-image	Debugging
0.5	Balanced control and creativity	Artistic freedom
1.0	Full control adherence	Precise matching
1.5+	Over-emphasized control	Very strict matching

Multi-ControlNet

Multiple ControlNets can be combined for simultaneous control:

Depth + Canny: 3D structure with precise edges
Pose + Face: Full body with detailed face control
Segmentation + Depth: What goes where with proper 3D

Summary

ControlNet adds powerful spatial conditioning to pretrained diffusion models without destroying their quality:

Problem solved: Text prompts alone cannot provide precise spatial control over generated images
Architecture: Trainable copy of encoder connected to frozen U-Net via zero convolutions
Zero convolution: Key innovation that enables stable training by starting with zero influence and gradually learning
Condition types: Edges, depth, pose, segmentation, and more - each offering different control granularity
Efficient training: Only ~300M parameters trained, requiring 50K-500K images and days instead of weeks
Flexible inference: Control strength adjustable, multiple ControlNets can be combined

Looking Ahead: In the next section, we'll explore image-to-image generation using SDEdit, which provides a different form of image conditioning without training additional networks.