Introduction
While text prompts provide intuitive control over image generation, they struggle to convey specific visual details like exact styles, textures, or the appearance of particular objects.IP-Adapter (Image Prompt Adapter) solves this by enabling images themselves to serve as prompts, allowing users to reference visual content directly rather than describing it in words.
IP-Adapter represents a lightweight, modular approach to image conditioning that works alongside existing text conditioning without requiring fine-tuning of the base model. This section explores its architecture, training procedure, and the powerful capabilities that emerge from combining image and text prompts.
Limitations of Text Conditioning
Text descriptions, despite their expressiveness, have fundamental limitations when guiding image generation:
| Aspect | Text Limitation | Image Advantage |
|---|---|---|
| Style Transfer | "Impressionist style" is ambiguous | Directly reference a Monet painting |
| Character Consistency | Cannot describe exact facial features | Use reference image of the character |
| Texture Details | "Weathered wood" varies wildly | Show the exact texture desired |
| Composition Reference | Complex layouts hard to describe | Use composition as visual template |
| Brand Identity | Logo details impossible to verbalize | Reference brand imagery directly |
The fundamental issue is the description bottleneck: some visual information is inherently difficult or impossible to capture in language. An image contains millions of pixels of precise information, while even detailed text prompts encode far less specificity.
The Image Prompt Paradigm: Instead of asking "how do I describe this in words?", IP-Adapter lets us ask "what image captures what I want?" This shifts the creative workflow from linguistic description to visual reference.
IP-Adapter Architecture
IP-Adapter introduces image conditioning through a clever architectural addition that preserves the original model's text conditioning capabilities while adding parallel image conditioning pathways.
Decoupled Cross-Attention
The key innovation is decoupled cross-attention, which separates text and image conditioning into parallel attention mechanisms rather than mixing them in a single attention layer:
1class DecoupledCrossAttention(nn.Module):
2 """
3 Cross-attention with separate text and image branches.
4 Original text cross-attention is frozen, image branch is trainable.
5 """
6 def __init__(self, hidden_dim: int, num_heads: int = 8):
7 super().__init__()
8 self.hidden_dim = hidden_dim
9 self.num_heads = num_heads
10 self.head_dim = hidden_dim // num_heads
11
12 # Original text cross-attention (FROZEN)
13 self.text_q = nn.Linear(hidden_dim, hidden_dim, bias=False)
14 self.text_k = nn.Linear(hidden_dim, hidden_dim, bias=False) # CLIP dim
15 self.text_v = nn.Linear(hidden_dim, hidden_dim, bias=False)
16
17 # New image cross-attention (TRAINABLE)
18 self.image_k = nn.Linear(hidden_dim, hidden_dim, bias=False)
19 self.image_v = nn.Linear(hidden_dim, hidden_dim, bias=False)
20
21 # Shared output projection
22 self.out_proj = nn.Linear(hidden_dim, hidden_dim)
23
24 def forward(
25 self,
26 hidden_states: torch.Tensor, # [B, N, D] - U-Net features
27 text_embeddings: torch.Tensor, # [B, T, D] - CLIP text
28 image_embeddings: torch.Tensor, # [B, I, D] - Projected image features
29 image_scale: float = 1.0
30 ) -> torch.Tensor:
31 B, N, D = hidden_states.shape
32
33 # Text cross-attention (original, frozen)
34 q = self.text_q(hidden_states)
35 k_text = self.text_k(text_embeddings)
36 v_text = self.text_v(text_embeddings)
37
38 text_attn = F.scaled_dot_product_attention(
39 q.view(B, N, self.num_heads, self.head_dim).transpose(1, 2),
40 k_text.view(B, -1, self.num_heads, self.head_dim).transpose(1, 2),
41 v_text.view(B, -1, self.num_heads, self.head_dim).transpose(1, 2)
42 ).transpose(1, 2).reshape(B, N, D)
43
44 # Image cross-attention (new, trainable)
45 k_image = self.image_k(image_embeddings)
46 v_image = self.image_v(image_embeddings)
47
48 image_attn = F.scaled_dot_product_attention(
49 q.view(B, N, self.num_heads, self.head_dim).transpose(1, 2),
50 k_image.view(B, -1, self.num_heads, self.head_dim).transpose(1, 2),
51 v_image.view(B, -1, self.num_heads, self.head_dim).transpose(1, 2)
52 ).transpose(1, 2).reshape(B, N, D)
53
54 # Combine: text + scaled image
55 combined = text_attn + image_scale * image_attn
56
57 return self.out_proj(combined)Understanding Decoupled Cross-Attention:
- Text pathway (frozen): Original cross-attention weights remain untouched, preserving text understanding
- Image pathway (trainable): New K and V projections learn to process image embeddings
- Shared queries: Both pathways use the same queries from U-Net features
- Additive combination: Text and image attention outputs are summed with adjustable scaling
- image_scale: Controls the influence of image conditioning (typically 0.0 to 1.5)
This decoupled design is crucial: by keeping text and image processing separate, IP-Adapter avoids interference between the two modalities. The model learns when image features are relevant without forgetting how to use text.
Image Encoder Design
IP-Adapter uses a pre-trained image encoder to extract features from reference images. The standard choice is CLIP ViT, which provides rich semantic features aligned with the text encoder used by Stable Diffusion:
1class IPAdapterImageEncoder(nn.Module):
2 """
3 Extracts and projects image features for IP-Adapter.
4 Uses CLIP ViT as the backbone.
5 """
6 def __init__(
7 self,
8 clip_model: str = "openai/clip-vit-large-patch14",
9 output_dim: int = 768,
10 num_tokens: int = 4
11 ):
12 super().__init__()
13
14 # Pre-trained CLIP image encoder (frozen)
15 self.image_encoder = CLIPVisionModel.from_pretrained(clip_model)
16 self.image_encoder.requires_grad_(False)
17
18 clip_dim = self.image_encoder.config.hidden_size # 1024 for ViT-L
19
20 # Options for which features to use
21 self.use_cls_token = True # Global image feature
22 self.use_patch_tokens = False # Spatial features (for Plus variants)
23
24 # Projection to cross-attention dimension
25 self.proj = nn.Linear(clip_dim, output_dim * num_tokens)
26 self.num_tokens = num_tokens
27 self.output_dim = output_dim
28
29 # Layer normalization for stable training
30 self.norm = nn.LayerNorm(output_dim)
31
32 def forward(self, images: torch.Tensor) -> torch.Tensor:
33 """
34 Args:
35 images: [B, 3, 224, 224] - Preprocessed images
36
37 Returns:
38 image_embeddings: [B, num_tokens, output_dim]
39 """
40 # Extract CLIP features (no gradient)
41 with torch.no_grad():
42 outputs = self.image_encoder(images)
43
44 if self.use_cls_token:
45 # Global feature from [CLS] token
46 features = outputs.pooler_output # [B, 1024]
47 else:
48 # All patch tokens for spatial information
49 features = outputs.last_hidden_state[:, 1:] # [B, 256, 1024]
50
51 # Project to target dimension
52 projected = self.proj(features) # [B, output_dim * num_tokens]
53
54 # Reshape to token sequence
55 B = images.shape[0]
56 image_embeddings = projected.view(B, self.num_tokens, self.output_dim)
57
58 return self.norm(image_embeddings)The parameter controls how many "image tokens" are created from each reference image. More tokens can capture more detail but increase computation. The standard IP-Adapter uses 4 tokens, while IP-Adapter-Plus uses 16 or more.
Training IP-Adapter
IP-Adapter training is remarkably efficient because most of the model remains frozen. Only the image projection network and the new cross-attention weights are trained.
Projection Network
The projection network transforms CLIP image features into a format suitable for cross-attention. Several architectures have been explored:
1class ImageProjectionMLP(nn.Module):
2 """Simple MLP projection for basic IP-Adapter."""
3 def __init__(self, clip_dim: int, output_dim: int, num_tokens: int):
4 super().__init__()
5 self.proj = nn.Sequential(
6 nn.Linear(clip_dim, clip_dim),
7 nn.GELU(),
8 nn.Linear(clip_dim, output_dim * num_tokens)
9 )
10 self.num_tokens = num_tokens
11 self.output_dim = output_dim
12
13 def forward(self, x):
14 # x: [B, clip_dim]
15 projected = self.proj(x) # [B, output_dim * num_tokens]
16 return projected.view(-1, self.num_tokens, self.output_dim)
17
18
19class ImageProjectionResampler(nn.Module):
20 """
21 Perceiver Resampler for IP-Adapter-Plus.
22 Uses cross-attention to extract relevant features from patch tokens.
23 """
24 def __init__(
25 self,
26 clip_dim: int = 1024,
27 output_dim: int = 768,
28 num_queries: int = 16,
29 num_layers: int = 4,
30 num_heads: int = 8
31 ):
32 super().__init__()
33
34 # Learnable query tokens
35 self.queries = nn.Parameter(torch.randn(1, num_queries, output_dim))
36
37 # Cross-attention layers
38 self.layers = nn.ModuleList([
39 nn.TransformerDecoderLayer(
40 d_model=output_dim,
41 nhead=num_heads,
42 dim_feedforward=output_dim * 4,
43 batch_first=True
44 )
45 for _ in range(num_layers)
46 ])
47
48 # Project CLIP features to output dimension
49 self.input_proj = nn.Linear(clip_dim, output_dim)
50 self.output_norm = nn.LayerNorm(output_dim)
51
52 def forward(self, clip_features: torch.Tensor) -> torch.Tensor:
53 """
54 Args:
55 clip_features: [B, num_patches, clip_dim] - CLIP patch tokens
56
57 Returns:
58 [B, num_queries, output_dim] - Condensed image features
59 """
60 B = clip_features.shape[0]
61
62 # Project to output dimension
63 memory = self.input_proj(clip_features) # [B, num_patches, output_dim]
64
65 # Expand queries for batch
66 queries = self.queries.expand(B, -1, -1) # [B, num_queries, output_dim]
67
68 # Cross-attend to image features
69 for layer in self.layers:
70 queries = layer(queries, memory)
71
72 return self.output_norm(queries)The Perceiver Resampler is more powerful than simple MLP projection because it can selectively attend to relevant image regions. This is especially important for style transfer where global texture patterns matter more than specific object locations.
Training Objective
Training follows the standard diffusion objective, but with image conditioning added. The key is using image-caption pairs where the image serves as both the generation target and the conditioning signal:
1def train_ip_adapter(
2 unet: UNet2DConditionModel,
3 ip_adapter: IPAdapter,
4 image_encoder: IPAdapterImageEncoder,
5 dataloader: DataLoader,
6 optimizer: torch.optim.Optimizer,
7 noise_scheduler: DDPMScheduler,
8 num_epochs: int = 100
9):
10 """
11 Training loop for IP-Adapter.
12 Only ip_adapter parameters are updated.
13 """
14 # Freeze everything except IP-Adapter
15 unet.requires_grad_(False)
16 image_encoder.requires_grad_(False)
17 ip_adapter.requires_grad_(True)
18
19 for epoch in range(num_epochs):
20 for batch in dataloader:
21 images = batch["image"] # Target images
22 text_embeddings = batch["text_embeddings"] # CLIP text features
23
24 # Encode reference images
25 # Key: Use the target image itself as the reference!
26 image_embeddings = image_encoder(images)
27 image_embeddings = ip_adapter.project(image_embeddings)
28
29 # Standard diffusion training
30 noise = torch.randn_like(images)
31 timesteps = torch.randint(0, 1000, (images.shape[0],))
32 noisy_images = noise_scheduler.add_noise(images, noise, timesteps)
33
34 # Predict noise with both text and image conditioning
35 noise_pred = unet(
36 noisy_images,
37 timesteps,
38 encoder_hidden_states=text_embeddings,
39 added_cond_kwargs={"image_embeds": image_embeddings}
40 ).sample
41
42 # MSE loss on noise prediction
43 loss = F.mse_loss(noise_pred, noise)
44
45 optimizer.zero_grad()
46 loss.backward()
47 optimizer.step()
48
49 return ip_adapterSelf-Referential Training: During training, the model learns to reconstruct images given their own CLIP embeddings. At inference, we can use any image as a reference, and the model will generate new images capturing similar semantic content.
The trainable parameter count is remarkably small, typically around 22M parametersfor the standard IP-Adapter, compared to ~860M for the full SD 1.5 model. Training can be completed on a single GPU in a day or less.
Combining Text and Image Prompts
The true power of IP-Adapter emerges when combining text and image conditioning. This enables precise control: images specify style or appearance, while text guides content and composition.
Weighted Conditioning
The balance between text and image influence is controlled by the image scaleparameter, which multiplies the image attention output before adding to text attention:
1def generate_with_ip_adapter(
2 pipe: StableDiffusionPipeline,
3 ip_adapter: IPAdapter,
4 prompt: str,
5 reference_image: PIL.Image,
6 image_scale: float = 0.6,
7 num_inference_steps: int = 50,
8 guidance_scale: float = 7.5
9):
10 """
11 Generate images using both text and image prompts.
12
13 Args:
14 prompt: Text description of desired output
15 reference_image: Image to use as style/content reference
16 image_scale: Weight for image conditioning (0.0 to 1.5)
17 0.0 = text only
18 0.5 = balanced
19 1.0 = strong image influence
20 >1.0 = image dominates
21 """
22 # Encode reference image
23 image_embeddings = ip_adapter.encode_image(reference_image)
24
25 # Set the image scale in cross-attention
26 ip_adapter.set_scale(image_scale)
27
28 # Generate with combined conditioning
29 output = pipe(
30 prompt=prompt,
31 ip_adapter_image_embeds=image_embeddings,
32 num_inference_steps=num_inference_steps,
33 guidance_scale=guidance_scale
34 )
35
36 return output.images[0]
37
38
39# Example: Style transfer with content guidance
40result = generate_with_ip_adapter(
41 pipe=pipe,
42 ip_adapter=ip_adapter,
43 prompt="a cozy cabin in the mountains at sunset",
44 reference_image=load_image("van_gogh_starry_night.jpg"),
45 image_scale=0.7 # Strong style influence from Van Gogh
46)Choosing Image Scale Values:
- 0.0-0.3: Subtle influence, mainly text-driven with hints of image style
- 0.4-0.6: Balanced blend, good for style transfer while maintaining text content
- 0.7-0.9: Strong image influence, output closely matches reference style/content
- 1.0+: Image dominates, useful for variations or style-locked generation
A powerful workflow involves using negative image prompts as well. By encoding an undesired reference image and subtracting its influence, you can steer generation away from specific styles or content.
IP-Adapter Variants
Several variants of IP-Adapter have been developed for specific use cases, each with architectural modifications tailored to different applications.
IP-Adapter-FaceID
For consistent face generation, IP-Adapter-FaceID uses face recognition embeddings instead of CLIP features. This preserves identity more accurately than generic image features:
1class IPAdapterFaceID(nn.Module):
2 """
3 IP-Adapter specialized for face identity preservation.
4 Uses face recognition embeddings for stronger identity consistency.
5 """
6 def __init__(self, face_dim: int = 512, output_dim: int = 768):
7 super().__init__()
8
9 # Face recognition model (InsightFace/ArcFace)
10 self.face_encoder = load_face_recognition_model()
11 self.face_encoder.requires_grad_(False)
12
13 # Project face embeddings to cross-attention dimension
14 self.face_proj = nn.Sequential(
15 nn.Linear(face_dim, output_dim),
16 nn.LayerNorm(output_dim),
17 nn.Linear(output_dim, output_dim * 4), # 4 tokens
18 )
19
20 # Optional: Additional CLIP features for context
21 self.use_clip_context = True
22 self.clip_encoder = CLIPVisionModel.from_pretrained("...")
23 self.clip_proj = nn.Linear(1024, output_dim * 4)
24
25 def encode_face(self, face_image: torch.Tensor) -> torch.Tensor:
26 """Extract and project face identity features."""
27 # Get face recognition embedding
28 with torch.no_grad():
29 face_emb = self.face_encoder(face_image) # [B, 512]
30
31 # Project to IP-Adapter format
32 face_tokens = self.face_proj(face_emb) # [B, output_dim * 4]
33 face_tokens = face_tokens.view(-1, 4, self.output_dim)
34
35 if self.use_clip_context:
36 # Add CLIP features for context (clothing, background hints)
37 clip_features = self.clip_encoder(face_image).pooler_output
38 clip_tokens = self.clip_proj(clip_features).view(-1, 4, self.output_dim)
39 return torch.cat([face_tokens, clip_tokens], dim=1) # [B, 8, D]
40
41 return face_tokensIP-Adapter-FaceID excels at tasks like:
- Generating consistent characters across multiple images
- Placing specific people in new scenarios or styles
- Creating variations while preserving identity
- Combining face identity with style transfer from another image
Plus Variants
IP-Adapter-Plus uses patch-level CLIP features instead of just the global CLS token, capturing more spatial detail:
| Variant | Features Used | Tokens | Best For |
|---|---|---|---|
| IP-Adapter | CLIP CLS token | 4 | General style transfer, simple references |
| IP-Adapter-Plus | CLIP patch tokens | 16 | Detailed style matching, textures |
| IP-Adapter-Full-Face | Face + CLIP patches | 16+ | Face + expression + context |
| IP-Adapter-Composition | Spatial features | 16 | Layout and composition transfer |
The Plus variants use the Perceiver Resampler to condense patch tokens into a fixed number of image tokens. This provides much richer spatial information while keeping the cross-attention cost manageable.
Implementation Details
Here's a complete IP-Adapter implementation showing how to integrate it with a Stable Diffusion pipeline:
1class IPAdapter:
2 """
3 Full IP-Adapter implementation for Stable Diffusion.
4 """
5 def __init__(
6 self,
7 pipe: StableDiffusionPipeline,
8 image_encoder_path: str = "openai/clip-vit-large-patch14",
9 ip_adapter_weights: str = "ip_adapter.bin",
10 device: str = "cuda"
11 ):
12 self.pipe = pipe
13 self.device = device
14
15 # Load image encoder
16 self.image_encoder = CLIPVisionModel.from_pretrained(
17 image_encoder_path
18 ).to(device)
19 self.image_encoder.requires_grad_(False)
20
21 # Load image processor
22 self.image_processor = CLIPImageProcessor.from_pretrained(
23 image_encoder_path
24 )
25
26 # Initialize projection layers
27 self.image_proj = ImageProjectionMLP(
28 clip_dim=1024,
29 output_dim=768,
30 num_tokens=4
31 ).to(device)
32
33 # Initialize cross-attention adapters
34 self.setup_ip_adapter_attention()
35
36 # Load pre-trained weights if provided
37 if ip_adapter_weights:
38 self.load_weights(ip_adapter_weights)
39
40 def setup_ip_adapter_attention(self):
41 """Inject IP-Adapter attention into U-Net."""
42 attn_procs = {}
43
44 for name, attn_module in self.pipe.unet.attn_processors.items():
45 if "attn2" in name: # Cross-attention layers
46 # Create IP-Adapter cross-attention
47 hidden_size = attn_module.to_q.weight.shape[0]
48 attn_procs[name] = IPAttnProcessor(
49 hidden_size=hidden_size,
50 cross_attention_dim=768,
51 num_tokens=4
52 )
53 else:
54 # Keep self-attention unchanged
55 attn_procs[name] = attn_module
56
57 self.pipe.unet.set_attn_processor(attn_procs)
58
59 @torch.no_grad()
60 def encode_image(self, image: PIL.Image) -> torch.Tensor:
61 """Encode PIL image to IP-Adapter embeddings."""
62 # Preprocess
63 pixel_values = self.image_processor(
64 images=image,
65 return_tensors="pt"
66 ).pixel_values.to(self.device)
67
68 # Encode with CLIP
69 clip_output = self.image_encoder(pixel_values)
70 image_features = clip_output.pooler_output # [1, 1024]
71
72 # Project to IP-Adapter format
73 image_embeddings = self.image_proj(image_features) # [1, 4, 768]
74
75 return image_embeddings
76
77 def set_scale(self, scale: float):
78 """Set the image conditioning scale."""
79 for attn_proc in self.pipe.unet.attn_processors.values():
80 if hasattr(attn_proc, 'scale'):
81 attn_proc.scale = scale
82
83 def generate(
84 self,
85 prompt: str,
86 image: PIL.Image,
87 negative_prompt: str = "",
88 scale: float = 0.6,
89 num_inference_steps: int = 50,
90 guidance_scale: float = 7.5,
91 **kwargs
92 ) -> PIL.Image:
93 """Generate with text + image conditioning."""
94 # Encode reference image
95 image_embeddings = self.encode_image(image)
96
97 # Set conditioning scale
98 self.set_scale(scale)
99
100 # Generate
101 output = self.pipe(
102 prompt=prompt,
103 negative_prompt=negative_prompt,
104 ip_adapter_image_embeds=image_embeddings,
105 num_inference_steps=num_inference_steps,
106 guidance_scale=guidance_scale,
107 **kwargs
108 )
109
110 return output.images[0]
111
112
113class IPAttnProcessor(nn.Module):
114 """Cross-attention processor with IP-Adapter."""
115 def __init__(self, hidden_size: int, cross_attention_dim: int, num_tokens: int):
116 super().__init__()
117 self.hidden_size = hidden_size
118 self.num_tokens = num_tokens
119 self.scale = 1.0
120
121 # Image K and V projections
122 self.to_k_ip = nn.Linear(cross_attention_dim, hidden_size, bias=False)
123 self.to_v_ip = nn.Linear(cross_attention_dim, hidden_size, bias=False)
124
125 def __call__(
126 self,
127 attn,
128 hidden_states: torch.Tensor,
129 encoder_hidden_states: torch.Tensor,
130 attention_mask=None,
131 ip_adapter_image_embeds=None
132 ):
133 # Original text cross-attention
134 query = attn.to_q(hidden_states)
135 key = attn.to_k(encoder_hidden_states)
136 value = attn.to_v(encoder_hidden_states)
137
138 text_attn_output = scaled_dot_product_attention(query, key, value)
139
140 # IP-Adapter image cross-attention
141 if ip_adapter_image_embeds is not None:
142 ip_key = self.to_k_ip(ip_adapter_image_embeds)
143 ip_value = self.to_v_ip(ip_adapter_image_embeds)
144
145 ip_attn_output = scaled_dot_product_attention(query, ip_key, ip_value)
146
147 # Combine with scaling
148 output = text_attn_output + self.scale * ip_attn_output
149 else:
150 output = text_attn_output
151
152 return attn.to_out(output)Practical Applications
IP-Adapter enables numerous practical workflows that are difficult or impossible with text prompts alone:
- Style Transfer with Content Control: Use a painting as the image prompt and text for the subject. "A portrait of a young scientist" + Van Gogh reference = Van Gogh-style scientist portrait.
- Product Photography Variations: Reference an existing product photo to maintain brand aesthetics while generating new scenes or angles through text.
- Character Consistency: With IP-Adapter-FaceID, generate the same character across multiple scenarios while varying text prompts for different actions and settings.
- Mood Boards to Images: Use multiple reference images (averaged or concatenated) to capture a complex aesthetic, then refine with text.
- Texture Transfer: Extract detailed textures from photographs and apply them to generated content through image prompts.
Workflow Tip: Start with image_scale around 0.5, then adjust based on results. Too low and the image influence disappears; too high and the text prompt is ignored. The sweet spot depends on how specific your image and text requirements are.
Summary
IP-Adapter represents a significant advancement in diffusion model conditioning, enabling intuitive image-based prompting alongside traditional text. Key takeaways:
- Decoupled cross-attention allows parallel text and image conditioning without interference, keeping text capabilities intact while adding image understanding.
- Lightweight training (22M parameters) makes IP-Adapter accessible for customization while maintaining the base model frozen.
- CLIP image features provide rich semantic embeddings that translate well across different visual domains.
- Variants like FaceID and Plus specialize the approach for faces, detailed styles, and spatial composition.
- Image scale control allows precise balancing between text and image influence for different creative requirements.
In the next section, we'll explore multi-modal conditioning more broadly, examining how multiple conditioning signals can be combined and weighted for fine-grained control over generation.