Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Explain the SDEdit algorithm and how it enables image-to-image transformation without additional training
Understand the noise injection process and its role in controlling the transformation
Use the strength parameter to balance between input preservation and prompt adherence
Implement img2img generation with Stable Diffusion
Apply img2img to practical tasks like style transfer, sketch refinement, and image editing

SDEdit: Stochastic Differential Editing

SDEdit, introduced by Meng et al. (2021), provides a remarkably simple yet powerful approach to image editing with diffusion models. The key insight:

Core Idea: Add noise to an input image to a certain level, then denoise it with a new text prompt. The amount of noise controls how much of the original image is preserved.

Why This Works

Consider the diffusion process:

At $t = 0$ : Clean image with all details
At $t = 500$ : Image structure visible, but details are noisy
At $t = 1000$ : Pure noise, no original image information

By adding noise to a specific timestep and then denoising:

Structure preserved: Large-scale features survive the noising
Details regenerated: Fine details are created fresh during denoising
Prompt guides regeneration: Text conditioning shapes the new details

SDEdit Algorithm

🐍sdedit.py

Explanation(5)

Code(20)

SDEdit adds noise to input image then denoises with text guidance

5Encode

Encode input image to latent space

8Add Noise

Add noise up to timestep t_start (controlled by strength)

12Denoise

Denoise from t_start to 0 with text conditioning

16Result

Output preserves structure but transforms style/content

15 lines without explanation

1def img2img(input_image, prompt, strength=0.75, num_steps=50):
2    """SDEdit: Transform image guided by text prompt."""
3
4    # 1. Encode input image to latent space
5    latents = vae.encode(input_image).latent_dist.sample() * 0.18215
6
7    # 2. Add noise up to timestep determined by strength
8    t_start = int(num_steps * strength)  # e.g., 37 for strength=0.75
9    noise = torch.randn_like(latents)
10    noisy_latents = scheduler.add_noise(latents, noise, t_start)
11
12    # 3. Denoise from t_start to 0 with text conditioning
13    text_embeddings = text_encoder(tokenizer(prompt))
14    for t in scheduler.timesteps[t_start:]:
15        noise_pred = unet(noisy_latents, t, text_embeddings)
16        noisy_latents = scheduler.step(noise_pred, t, noisy_latents).prev_sample
17
18    # 4. Decode to pixel space
19    image = vae.decode(noisy_latents / 0.18215).sample
20    return image

The Noise Injection Process

The noise injection follows the standard forward diffusion formula:

$z_t = \sqrt{\bar{\alpha}_t} z_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon$

Where:

$z_0$ is the encoded input image
$\epsilon \sim \mathcal{N}(0, I)$ is random noise
$\bar{\alpha}_t$ determines the signal-to-noise ratio

Signal Preservation

The amount of original image signal preserved depends on $\bar{\alpha}_t$ :

Timestep	alpha_bar	Signal Ratio	What Survives
t = 100	0.98	98%	Almost everything
t = 300	0.85	85%	Overall structure, colors
t = 500	0.60	60%	Major shapes, composition
t = 700	0.30	30%	Rough layout only
t = 900	0.05	5%	Almost nothing

Intuition: Noise injection is like progressively blurring an image. At low noise levels, you can still see details. At high noise levels, only the vague silhouette remains.

The Strength Parameter

The strength parameter (0.0 to 1.0) controls the transformation intensity:

Strength Parameter

🐍strength.py

Explanation(4)

Code(14)

Strength controls how much of the image is regenerated

4Low Strength

0.3 = start at t=300, preserve most of original

7Medium Strength

0.6 = start at t=600, balanced transformation

10High Strength

0.9 = start at t=900, almost pure text-to-image

10 lines without explanation

1# Strength determines starting timestep
2num_inference_steps = 50
3
4# Low strength: minimal change
5t_start = int(50 * 0.3)  # Start at step 15
6# Result: Subtle style adjustment, original image mostly preserved
7
8# Medium strength: balanced transformation
9t_start = int(50 * 0.6)  # Start at step 30
10# Result: Style changed, structure preserved
11
12# High strength: major transformation
13t_start = int(50 * 0.9)  # Start at step 45
14# Result: Almost full regeneration, only hints of original

Choosing the Right Strength

Strength	Use Case	Original Preserved
0.2 - 0.3	Subtle tweaks, color correction	Very high
0.4 - 0.5	Style transfer, texture changes	High
0.5 - 0.7	Significant style change, content preserved	Medium
0.7 - 0.8	Major transformation, composition kept	Low
0.8 - 1.0	Almost pure generation, slight hints	Minimal

Common Mistake

Many users set strength too high (0.8+) when trying to preserve structure. For most editing tasks, 0.4-0.6 provides a better balance between prompt adherence and input preservation.

Implementation

Here's a complete implementation using the Diffusers library:

Basic img2img

Image-to-Image with Diffusers

🐍img2img_diffusers.py

Explanation(4)

Code(23)

Import the specialized img2img pipeline

6Load Model

Load Stable Diffusion with img2img support

12Prepare Input

Input must be resized to model resolution

16Generate

Strength=0.6 means 40% original, 60% regenerated

19 lines without explanation

1from diffusers import StableDiffusionImg2ImgPipeline
2from PIL import Image
3
4# Load the pipeline
5pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
6    "runwayml/stable-diffusion-v1-5",
7    torch_dtype=torch.float16
8).to("cuda")
9
10# Load and resize input image
11input_image = Image.open("sketch.png").convert("RGB")
12input_image = input_image.resize((512, 512))
13
14# Generate transformed image
15output = pipe(
16    prompt="A detailed oil painting of a mountain landscape",
17    image=input_image,
18    strength=0.6,           # How much to transform
19    guidance_scale=7.5,     # Prompt adherence
20    num_inference_steps=50
21)
22
23output.images[0].save("transformed.png")

Advanced: Inpainting

A variant of img2img that only regenerates masked regions:

Input: Image + mask (white = regenerate, black = keep)
Process: Only noisy regions are denoised; masked areas preserved
Use case: Object removal, adding elements, fixing artifacts

Applications and Use Cases

1. Style Transfer

Transform images to different artistic styles:

Input: Photograph
Prompt: "An oil painting in the style of Van Gogh"
Strength: 0.5-0.7 (preserve structure, change style)

2. Sketch to Image

Refine rough sketches into detailed images:

Input: Hand-drawn sketch
Prompt: "Detailed digital art of [subject]"
Strength: 0.6-0.8 (add details, follow structure)

3. Photo Enhancement

Improve photo quality and add details:

Input: Low-quality or blurry photo
Prompt: "High quality, detailed photograph of [subject]"
Strength: 0.3-0.5 (enhance without changing content)

4. Scene Modifications

Change aspects of a scene while keeping composition:

Input: Daytime scene
Prompt: "Same scene at night with stars"
Strength: 0.5-0.7 (change lighting, keep structure)

Pro Tip: For best results, describe both what you want AND elements from the original image in your prompt. This helps guide the model to preserve desired features.

Summary

Image-to-image generation with SDEdit provides powerful transformation capabilities without additional training:

SDEdit algorithm: Add noise to input, then denoise with new prompt - simple yet effective
Noise injection: Forward diffusion formula determines how much original signal is preserved
Strength parameter: 0.0 = no change, 1.0 = pure generation; typical range 0.4-0.7 for editing
Applications: Style transfer, sketch refinement, photo enhancement, scene modifications
Key insight: Structure survives noising better than details, enabling structural preservation with detail regeneration

Looking Ahead: In the next section, we'll explore IP-Adapter, which enables using images as prompts - providing a different form of image conditioning that captures style and content semantically.