Chapter 14
15 min read
Section 64 of 76

Image-to-Image Generation

Advanced Conditioning Techniques

Learning Objectives

By the end of this section, you will be able to:

  1. Explain the SDEdit algorithm and how it enables image-to-image transformation without additional training
  2. Understand the noise injection process and its role in controlling the transformation
  3. Use the strength parameter to balance between input preservation and prompt adherence
  4. Implement img2img generation with Stable Diffusion
  5. Apply img2img to practical tasks like style transfer, sketch refinement, and image editing

SDEdit: Stochastic Differential Editing

SDEdit, introduced by Meng et al. (2021), provides a remarkably simple yet powerful approach to image editing with diffusion models. The key insight:

Core Idea: Add noise to an input image to a certain level, then denoise it with a new text prompt. The amount of noise controls how much of the original image is preserved.

Why This Works

Consider the diffusion process:

  • At t=0t = 0: Clean image with all details
  • At t=500t = 500: Image structure visible, but details are noisy
  • At t=1000t = 1000: Pure noise, no original image information

By adding noise to a specific timestep and then denoising:

  • Structure preserved: Large-scale features survive the noising
  • Details regenerated: Fine details are created fresh during denoising
  • Prompt guides regeneration: Text conditioning shapes the new details
SDEdit Algorithm
🐍sdedit.py
1

SDEdit adds noise to input image then denoises with text guidance

5Encode

Encode input image to latent space

8Add Noise

Add noise up to timestep t_start (controlled by strength)

12Denoise

Denoise from t_start to 0 with text conditioning

16Result

Output preserves structure but transforms style/content

15 lines without explanation
1def img2img(input_image, prompt, strength=0.75, num_steps=50):
2    """SDEdit: Transform image guided by text prompt."""
3
4    # 1. Encode input image to latent space
5    latents = vae.encode(input_image).latent_dist.sample() * 0.18215
6
7    # 2. Add noise up to timestep determined by strength
8    t_start = int(num_steps * strength)  # e.g., 37 for strength=0.75
9    noise = torch.randn_like(latents)
10    noisy_latents = scheduler.add_noise(latents, noise, t_start)
11
12    # 3. Denoise from t_start to 0 with text conditioning
13    text_embeddings = text_encoder(tokenizer(prompt))
14    for t in scheduler.timesteps[t_start:]:
15        noise_pred = unet(noisy_latents, t, text_embeddings)
16        noisy_latents = scheduler.step(noise_pred, t, noisy_latents).prev_sample
17
18    # 4. Decode to pixel space
19    image = vae.decode(noisy_latents / 0.18215).sample
20    return image

The Noise Injection Process

The noise injection follows the standard forward diffusion formula:

zt=αˉtz0+1αˉtϵz_t = \sqrt{\bar{\alpha}_t} z_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon

Where:

  • z0z_0 is the encoded input image
  • ϵN(0,I)\epsilon \sim \mathcal{N}(0, I) is random noise
  • αˉt\bar{\alpha}_t determines the signal-to-noise ratio

Signal Preservation

The amount of original image signal preserved depends on αˉt\bar{\alpha}_t:

Timestepalpha_barSignal RatioWhat Survives
t = 1000.9898%Almost everything
t = 3000.8585%Overall structure, colors
t = 5000.6060%Major shapes, composition
t = 7000.3030%Rough layout only
t = 9000.055%Almost nothing
Intuition: Noise injection is like progressively blurring an image. At low noise levels, you can still see details. At high noise levels, only the vague silhouette remains.

The Strength Parameter

The strength parameter (0.0 to 1.0) controls the transformation intensity:

Strength Parameter
🐍strength.py
1

Strength controls how much of the image is regenerated

4Low Strength

0.3 = start at t=300, preserve most of original

7Medium Strength

0.6 = start at t=600, balanced transformation

10High Strength

0.9 = start at t=900, almost pure text-to-image

10 lines without explanation
1# Strength determines starting timestep
2num_inference_steps = 50
3
4# Low strength: minimal change
5t_start = int(50 * 0.3)  # Start at step 15
6# Result: Subtle style adjustment, original image mostly preserved
7
8# Medium strength: balanced transformation
9t_start = int(50 * 0.6)  # Start at step 30
10# Result: Style changed, structure preserved
11
12# High strength: major transformation
13t_start = int(50 * 0.9)  # Start at step 45
14# Result: Almost full regeneration, only hints of original

Choosing the Right Strength

StrengthUse CaseOriginal Preserved
0.2 - 0.3Subtle tweaks, color correctionVery high
0.4 - 0.5Style transfer, texture changesHigh
0.5 - 0.7Significant style change, content preservedMedium
0.7 - 0.8Major transformation, composition keptLow
0.8 - 1.0Almost pure generation, slight hintsMinimal

Common Mistake

Many users set strength too high (0.8+) when trying to preserve structure. For most editing tasks, 0.4-0.6 provides a better balance between prompt adherence and input preservation.

Implementation

Here's a complete implementation using the Diffusers library:

Basic img2img

Image-to-Image with Diffusers
🐍img2img_diffusers.py
1

Import the specialized img2img pipeline

6Load Model

Load Stable Diffusion with img2img support

12Prepare Input

Input must be resized to model resolution

16Generate

Strength=0.6 means 40% original, 60% regenerated

19 lines without explanation
1from diffusers import StableDiffusionImg2ImgPipeline
2from PIL import Image
3
4# Load the pipeline
5pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
6    "runwayml/stable-diffusion-v1-5",
7    torch_dtype=torch.float16
8).to("cuda")
9
10# Load and resize input image
11input_image = Image.open("sketch.png").convert("RGB")
12input_image = input_image.resize((512, 512))
13
14# Generate transformed image
15output = pipe(
16    prompt="A detailed oil painting of a mountain landscape",
17    image=input_image,
18    strength=0.6,           # How much to transform
19    guidance_scale=7.5,     # Prompt adherence
20    num_inference_steps=50
21)
22
23output.images[0].save("transformed.png")

Advanced: Inpainting

A variant of img2img that only regenerates masked regions:

  • Input: Image + mask (white = regenerate, black = keep)
  • Process: Only noisy regions are denoised; masked areas preserved
  • Use case: Object removal, adding elements, fixing artifacts

Applications and Use Cases

1. Style Transfer

Transform images to different artistic styles:

  • Input: Photograph
  • Prompt: "An oil painting in the style of Van Gogh"
  • Strength: 0.5-0.7 (preserve structure, change style)

2. Sketch to Image

Refine rough sketches into detailed images:

  • Input: Hand-drawn sketch
  • Prompt: "Detailed digital art of [subject]"
  • Strength: 0.6-0.8 (add details, follow structure)

3. Photo Enhancement

Improve photo quality and add details:

  • Input: Low-quality or blurry photo
  • Prompt: "High quality, detailed photograph of [subject]"
  • Strength: 0.3-0.5 (enhance without changing content)

4. Scene Modifications

Change aspects of a scene while keeping composition:

  • Input: Daytime scene
  • Prompt: "Same scene at night with stars"
  • Strength: 0.5-0.7 (change lighting, keep structure)
Pro Tip: For best results, describe both what you want AND elements from the original image in your prompt. This helps guide the model to preserve desired features.

Summary

Image-to-image generation with SDEdit provides powerful transformation capabilities without additional training:

  1. SDEdit algorithm: Add noise to input, then denoise with new prompt - simple yet effective
  2. Noise injection: Forward diffusion formula determines how much original signal is preserved
  3. Strength parameter: 0.0 = no change, 1.0 = pure generation; typical range 0.4-0.7 for editing
  4. Applications: Style transfer, sketch refinement, photo enhancement, scene modifications
  5. Key insight: Structure survives noising better than details, enabling structural preservation with detail regeneration
Looking Ahead: In the next section, we'll explore IP-Adapter, which enables using images as prompts - providing a different form of image conditioning that captures style and content semantically.