Boo-AI — Master Artificial Intelligence by Building from Scratch

Introduction

While text prompts provide intuitive control over image generation, they struggle to convey specific visual details like exact styles, textures, or the appearance of particular objects.IP-Adapter (Image Prompt Adapter) solves this by enabling images themselves to serve as prompts, allowing users to reference visual content directly rather than describing it in words.

IP-Adapter represents a lightweight, modular approach to image conditioning that works alongside existing text conditioning without requiring fine-tuning of the base model. This section explores its architecture, training procedure, and the powerful capabilities that emerge from combining image and text prompts.

Limitations of Text Conditioning

Text descriptions, despite their expressiveness, have fundamental limitations when guiding image generation:

Aspect	Text Limitation	Image Advantage
Style Transfer	"Impressionist style" is ambiguous	Directly reference a Monet painting
Character Consistency	Cannot describe exact facial features	Use reference image of the character
Texture Details	"Weathered wood" varies wildly	Show the exact texture desired
Composition Reference	Complex layouts hard to describe	Use composition as visual template
Brand Identity	Logo details impossible to verbalize	Reference brand imagery directly

The fundamental issue is the description bottleneck: some visual information is inherently difficult or impossible to capture in language. An image contains millions of pixels of precise information, while even detailed text prompts encode far less specificity.

The Image Prompt Paradigm: Instead of asking "how do I describe this in words?", IP-Adapter lets us ask "what image captures what I want?" This shifts the creative workflow from linguistic description to visual reference.

IP-Adapter Architecture

IP-Adapter introduces image conditioning through a clever architectural addition that preserves the original model's text conditioning capabilities while adding parallel image conditioning pathways.

Decoupled Cross-Attention

The key innovation is decoupled cross-attention, which separates text and image conditioning into parallel attention mechanisms rather than mixing them in a single attention layer:

🐍python

1class DecoupledCrossAttention(nn.Module):
2    """
3    Cross-attention with separate text and image branches.
4    Original text cross-attention is frozen, image branch is trainable.
5    """
6    def __init__(self, hidden_dim: int, num_heads: int = 8):
7        super().__init__()
8        self.hidden_dim = hidden_dim
9        self.num_heads = num_heads
10        self.head_dim = hidden_dim // num_heads
11
12        # Original text cross-attention (FROZEN)
13        self.text_q = nn.Linear(hidden_dim, hidden_dim, bias=False)
14        self.text_k = nn.Linear(hidden_dim, hidden_dim, bias=False)  # CLIP dim
15        self.text_v = nn.Linear(hidden_dim, hidden_dim, bias=False)
16
17        # New image cross-attention (TRAINABLE)
18        self.image_k = nn.Linear(hidden_dim, hidden_dim, bias=False)
19        self.image_v = nn.Linear(hidden_dim, hidden_dim, bias=False)
20
21        # Shared output projection
22        self.out_proj = nn.Linear(hidden_dim, hidden_dim)
23
24    def forward(
25        self,
26        hidden_states: torch.Tensor,      # [B, N, D] - U-Net features
27        text_embeddings: torch.Tensor,     # [B, T, D] - CLIP text
28        image_embeddings: torch.Tensor,    # [B, I, D] - Projected image features
29        image_scale: float = 1.0
30    ) -> torch.Tensor:
31        B, N, D = hidden_states.shape
32
33        # Text cross-attention (original, frozen)
34        q = self.text_q(hidden_states)
35        k_text = self.text_k(text_embeddings)
36        v_text = self.text_v(text_embeddings)
37
38        text_attn = F.scaled_dot_product_attention(
39            q.view(B, N, self.num_heads, self.head_dim).transpose(1, 2),
40            k_text.view(B, -1, self.num_heads, self.head_dim).transpose(1, 2),
41            v_text.view(B, -1, self.num_heads, self.head_dim).transpose(1, 2)
42        ).transpose(1, 2).reshape(B, N, D)
43
44        # Image cross-attention (new, trainable)
45        k_image = self.image_k(image_embeddings)
46        v_image = self.image_v(image_embeddings)
47
48        image_attn = F.scaled_dot_product_attention(
49            q.view(B, N, self.num_heads, self.head_dim).transpose(1, 2),
50            k_image.view(B, -1, self.num_heads, self.head_dim).transpose(1, 2),
51            v_image.view(B, -1, self.num_heads, self.head_dim).transpose(1, 2)
52        ).transpose(1, 2).reshape(B, N, D)
53
54        # Combine: text + scaled image
55        combined = text_attn + image_scale * image_attn
56
57        return self.out_proj(combined)

Understanding Decoupled Cross-Attention:
Text pathway (frozen): Original cross-attention weights remain untouched, preserving text understanding
Image pathway (trainable): New K and V projections learn to process image embeddings
Shared queries: Both pathways use the same queries from U-Net features
Additive combination: Text and image attention outputs are summed with adjustable scaling
image_scale: Controls the influence of image conditioning (typically 0.0 to 1.5)

This decoupled design is crucial: by keeping text and image processing separate, IP-Adapter avoids interference between the two modalities. The model learns when image features are relevant without forgetting how to use text.

Image Encoder Design

IP-Adapter uses a pre-trained image encoder to extract features from reference images. The standard choice is CLIP ViT, which provides rich semantic features aligned with the text encoder used by Stable Diffusion:

🐍python

1class IPAdapterImageEncoder(nn.Module):
2    """
3    Extracts and projects image features for IP-Adapter.
4    Uses CLIP ViT as the backbone.
5    """
6    def __init__(
7        self,
8        clip_model: str = "openai/clip-vit-large-patch14",
9        output_dim: int = 768,
10        num_tokens: int = 4
11    ):
12        super().__init__()
13
14        # Pre-trained CLIP image encoder (frozen)
15        self.image_encoder = CLIPVisionModel.from_pretrained(clip_model)
16        self.image_encoder.requires_grad_(False)
17
18        clip_dim = self.image_encoder.config.hidden_size  # 1024 for ViT-L
19
20        # Options for which features to use
21        self.use_cls_token = True      # Global image feature
22        self.use_patch_tokens = False  # Spatial features (for Plus variants)
23
24        # Projection to cross-attention dimension
25        self.proj = nn.Linear(clip_dim, output_dim * num_tokens)
26        self.num_tokens = num_tokens
27        self.output_dim = output_dim
28
29        # Layer normalization for stable training
30        self.norm = nn.LayerNorm(output_dim)
31
32    def forward(self, images: torch.Tensor) -> torch.Tensor:
33        """
34        Args:
35            images: [B, 3, 224, 224] - Preprocessed images
36
37        Returns:
38            image_embeddings: [B, num_tokens, output_dim]
39        """
40        # Extract CLIP features (no gradient)
41        with torch.no_grad():
42            outputs = self.image_encoder(images)
43
44            if self.use_cls_token:
45                # Global feature from [CLS] token
46                features = outputs.pooler_output  # [B, 1024]
47            else:
48                # All patch tokens for spatial information
49                features = outputs.last_hidden_state[:, 1:]  # [B, 256, 1024]
50
51        # Project to target dimension
52        projected = self.proj(features)  # [B, output_dim * num_tokens]
53
54        # Reshape to token sequence
55        B = images.shape[0]
56        image_embeddings = projected.view(B, self.num_tokens, self.output_dim)
57
58        return self.norm(image_embeddings)

The $\text{num\_tokens}$ parameter controls how many "image tokens" are created from each reference image. More tokens can capture more detail but increase computation. The standard IP-Adapter uses 4 tokens, while IP-Adapter-Plus uses 16 or more.

Training IP-Adapter

IP-Adapter training is remarkably efficient because most of the model remains frozen. Only the image projection network and the new cross-attention weights are trained.

Projection Network

The projection network transforms CLIP image features into a format suitable for cross-attention. Several architectures have been explored:

🐍python

1class ImageProjectionMLP(nn.Module):
2    """Simple MLP projection for basic IP-Adapter."""
3    def __init__(self, clip_dim: int, output_dim: int, num_tokens: int):
4        super().__init__()
5        self.proj = nn.Sequential(
6            nn.Linear(clip_dim, clip_dim),
7            nn.GELU(),
8            nn.Linear(clip_dim, output_dim * num_tokens)
9        )
10        self.num_tokens = num_tokens
11        self.output_dim = output_dim
12
13    def forward(self, x):
14        # x: [B, clip_dim]
15        projected = self.proj(x)  # [B, output_dim * num_tokens]
16        return projected.view(-1, self.num_tokens, self.output_dim)
17
18
19class ImageProjectionResampler(nn.Module):
20    """
21    Perceiver Resampler for IP-Adapter-Plus.
22    Uses cross-attention to extract relevant features from patch tokens.
23    """
24    def __init__(
25        self,
26        clip_dim: int = 1024,
27        output_dim: int = 768,
28        num_queries: int = 16,
29        num_layers: int = 4,
30        num_heads: int = 8
31    ):
32        super().__init__()
33
34        # Learnable query tokens
35        self.queries = nn.Parameter(torch.randn(1, num_queries, output_dim))
36
37        # Cross-attention layers
38        self.layers = nn.ModuleList([
39            nn.TransformerDecoderLayer(
40                d_model=output_dim,
41                nhead=num_heads,
42                dim_feedforward=output_dim * 4,
43                batch_first=True
44            )
45            for _ in range(num_layers)
46        ])
47
48        # Project CLIP features to output dimension
49        self.input_proj = nn.Linear(clip_dim, output_dim)
50        self.output_norm = nn.LayerNorm(output_dim)
51
52    def forward(self, clip_features: torch.Tensor) -> torch.Tensor:
53        """
54        Args:
55            clip_features: [B, num_patches, clip_dim] - CLIP patch tokens
56
57        Returns:
58            [B, num_queries, output_dim] - Condensed image features
59        """
60        B = clip_features.shape[0]
61
62        # Project to output dimension
63        memory = self.input_proj(clip_features)  # [B, num_patches, output_dim]
64
65        # Expand queries for batch
66        queries = self.queries.expand(B, -1, -1)  # [B, num_queries, output_dim]
67
68        # Cross-attend to image features
69        for layer in self.layers:
70            queries = layer(queries, memory)
71
72        return self.output_norm(queries)

The Perceiver Resampler is more powerful than simple MLP projection because it can selectively attend to relevant image regions. This is especially important for style transfer where global texture patterns matter more than specific object locations.

Training Objective

Training follows the standard diffusion objective, but with image conditioning added. The key is using image-caption pairs where the image serves as both the generation target and the conditioning signal:

🐍python

1def train_ip_adapter(
2    unet: UNet2DConditionModel,
3    ip_adapter: IPAdapter,
4    image_encoder: IPAdapterImageEncoder,
5    dataloader: DataLoader,
6    optimizer: torch.optim.Optimizer,
7    noise_scheduler: DDPMScheduler,
8    num_epochs: int = 100
9):
10    """
11    Training loop for IP-Adapter.
12    Only ip_adapter parameters are updated.
13    """
14    # Freeze everything except IP-Adapter
15    unet.requires_grad_(False)
16    image_encoder.requires_grad_(False)
17    ip_adapter.requires_grad_(True)
18
19    for epoch in range(num_epochs):
20        for batch in dataloader:
21            images = batch["image"]           # Target images
22            text_embeddings = batch["text_embeddings"]  # CLIP text features
23
24            # Encode reference images
25            # Key: Use the target image itself as the reference!
26            image_embeddings = image_encoder(images)
27            image_embeddings = ip_adapter.project(image_embeddings)
28
29            # Standard diffusion training
30            noise = torch.randn_like(images)
31            timesteps = torch.randint(0, 1000, (images.shape[0],))
32            noisy_images = noise_scheduler.add_noise(images, noise, timesteps)
33
34            # Predict noise with both text and image conditioning
35            noise_pred = unet(
36                noisy_images,
37                timesteps,
38                encoder_hidden_states=text_embeddings,
39                added_cond_kwargs={"image_embeds": image_embeddings}
40            ).sample
41
42            # MSE loss on noise prediction
43            loss = F.mse_loss(noise_pred, noise)
44
45            optimizer.zero_grad()
46            loss.backward()
47            optimizer.step()
48
49    return ip_adapter

Self-Referential Training: During training, the model learns to reconstruct images given their own CLIP embeddings. At inference, we can use any image as a reference, and the model will generate new images capturing similar semantic content.

The trainable parameter count is remarkably small, typically around 22M parametersfor the standard IP-Adapter, compared to ~860M for the full SD 1.5 model. Training can be completed on a single GPU in a day or less.

Combining Text and Image Prompts

The true power of IP-Adapter emerges when combining text and image conditioning. This enables precise control: images specify style or appearance, while text guides content and composition.

Weighted Conditioning

The balance between text and image influence is controlled by the image scaleparameter, which multiplies the image attention output before adding to text attention:

🐍python

1def generate_with_ip_adapter(
2    pipe: StableDiffusionPipeline,
3    ip_adapter: IPAdapter,
4    prompt: str,
5    reference_image: PIL.Image,
6    image_scale: float = 0.6,
7    num_inference_steps: int = 50,
8    guidance_scale: float = 7.5
9):
10    """
11    Generate images using both text and image prompts.
12
13    Args:
14        prompt: Text description of desired output
15        reference_image: Image to use as style/content reference
16        image_scale: Weight for image conditioning (0.0 to 1.5)
17                    0.0 = text only
18                    0.5 = balanced
19                    1.0 = strong image influence
20                    >1.0 = image dominates
21    """
22    # Encode reference image
23    image_embeddings = ip_adapter.encode_image(reference_image)
24
25    # Set the image scale in cross-attention
26    ip_adapter.set_scale(image_scale)
27
28    # Generate with combined conditioning
29    output = pipe(
30        prompt=prompt,
31        ip_adapter_image_embeds=image_embeddings,
32        num_inference_steps=num_inference_steps,
33        guidance_scale=guidance_scale
34    )
35
36    return output.images[0]
37
38
39# Example: Style transfer with content guidance
40result = generate_with_ip_adapter(
41    pipe=pipe,
42    ip_adapter=ip_adapter,
43    prompt="a cozy cabin in the mountains at sunset",
44    reference_image=load_image("van_gogh_starry_night.jpg"),
45    image_scale=0.7  # Strong style influence from Van Gogh
46)

Choosing Image Scale Values:
0.0-0.3: Subtle influence, mainly text-driven with hints of image style
0.4-0.6: Balanced blend, good for style transfer while maintaining text content
0.7-0.9: Strong image influence, output closely matches reference style/content
1.0+: Image dominates, useful for variations or style-locked generation

A powerful workflow involves using negative image prompts as well. By encoding an undesired reference image and subtracting its influence, you can steer generation away from specific styles or content.

IP-Adapter Variants

Several variants of IP-Adapter have been developed for specific use cases, each with architectural modifications tailored to different applications.

IP-Adapter-FaceID

For consistent face generation, IP-Adapter-FaceID uses face recognition embeddings instead of CLIP features. This preserves identity more accurately than generic image features:

🐍python

1class IPAdapterFaceID(nn.Module):
2    """
3    IP-Adapter specialized for face identity preservation.
4    Uses face recognition embeddings for stronger identity consistency.
5    """
6    def __init__(self, face_dim: int = 512, output_dim: int = 768):
7        super().__init__()
8
9        # Face recognition model (InsightFace/ArcFace)
10        self.face_encoder = load_face_recognition_model()
11        self.face_encoder.requires_grad_(False)
12
13        # Project face embeddings to cross-attention dimension
14        self.face_proj = nn.Sequential(
15            nn.Linear(face_dim, output_dim),
16            nn.LayerNorm(output_dim),
17            nn.Linear(output_dim, output_dim * 4),  # 4 tokens
18        )
19
20        # Optional: Additional CLIP features for context
21        self.use_clip_context = True
22        self.clip_encoder = CLIPVisionModel.from_pretrained("...")
23        self.clip_proj = nn.Linear(1024, output_dim * 4)
24
25    def encode_face(self, face_image: torch.Tensor) -> torch.Tensor:
26        """Extract and project face identity features."""
27        # Get face recognition embedding
28        with torch.no_grad():
29            face_emb = self.face_encoder(face_image)  # [B, 512]
30
31        # Project to IP-Adapter format
32        face_tokens = self.face_proj(face_emb)  # [B, output_dim * 4]
33        face_tokens = face_tokens.view(-1, 4, self.output_dim)
34
35        if self.use_clip_context:
36            # Add CLIP features for context (clothing, background hints)
37            clip_features = self.clip_encoder(face_image).pooler_output
38            clip_tokens = self.clip_proj(clip_features).view(-1, 4, self.output_dim)
39            return torch.cat([face_tokens, clip_tokens], dim=1)  # [B, 8, D]
40
41        return face_tokens

IP-Adapter-FaceID excels at tasks like:

Generating consistent characters across multiple images
Placing specific people in new scenarios or styles
Creating variations while preserving identity
Combining face identity with style transfer from another image

Plus Variants

IP-Adapter-Plus uses patch-level CLIP features instead of just the global CLS token, capturing more spatial detail:

Variant	Features Used	Tokens	Best For
IP-Adapter	CLIP CLS token	4	General style transfer, simple references
IP-Adapter-Plus	CLIP patch tokens	16	Detailed style matching, textures
IP-Adapter-Full-Face	Face + CLIP patches	16+	Face + expression + context
IP-Adapter-Composition	Spatial features	16	Layout and composition transfer

The Plus variants use the Perceiver Resampler to condense patch tokens into a fixed number of image tokens. This provides much richer spatial information while keeping the cross-attention cost manageable.

Implementation Details

Here's a complete IP-Adapter implementation showing how to integrate it with a Stable Diffusion pipeline:

🐍python

1class IPAdapter:
2    """
3    Full IP-Adapter implementation for Stable Diffusion.
4    """
5    def __init__(
6        self,
7        pipe: StableDiffusionPipeline,
8        image_encoder_path: str = "openai/clip-vit-large-patch14",
9        ip_adapter_weights: str = "ip_adapter.bin",
10        device: str = "cuda"
11    ):
12        self.pipe = pipe
13        self.device = device
14
15        # Load image encoder
16        self.image_encoder = CLIPVisionModel.from_pretrained(
17            image_encoder_path
18        ).to(device)
19        self.image_encoder.requires_grad_(False)
20
21        # Load image processor
22        self.image_processor = CLIPImageProcessor.from_pretrained(
23            image_encoder_path
24        )
25
26        # Initialize projection layers
27        self.image_proj = ImageProjectionMLP(
28            clip_dim=1024,
29            output_dim=768,
30            num_tokens=4
31        ).to(device)
32
33        # Initialize cross-attention adapters
34        self.setup_ip_adapter_attention()
35
36        # Load pre-trained weights if provided
37        if ip_adapter_weights:
38            self.load_weights(ip_adapter_weights)
39
40    def setup_ip_adapter_attention(self):
41        """Inject IP-Adapter attention into U-Net."""
42        attn_procs = {}
43
44        for name, attn_module in self.pipe.unet.attn_processors.items():
45            if "attn2" in name:  # Cross-attention layers
46                # Create IP-Adapter cross-attention
47                hidden_size = attn_module.to_q.weight.shape[0]
48                attn_procs[name] = IPAttnProcessor(
49                    hidden_size=hidden_size,
50                    cross_attention_dim=768,
51                    num_tokens=4
52                )
53            else:
54                # Keep self-attention unchanged
55                attn_procs[name] = attn_module
56
57        self.pipe.unet.set_attn_processor(attn_procs)
58
59    @torch.no_grad()
60    def encode_image(self, image: PIL.Image) -> torch.Tensor:
61        """Encode PIL image to IP-Adapter embeddings."""
62        # Preprocess
63        pixel_values = self.image_processor(
64            images=image,
65            return_tensors="pt"
66        ).pixel_values.to(self.device)
67
68        # Encode with CLIP
69        clip_output = self.image_encoder(pixel_values)
70        image_features = clip_output.pooler_output  # [1, 1024]
71
72        # Project to IP-Adapter format
73        image_embeddings = self.image_proj(image_features)  # [1, 4, 768]
74
75        return image_embeddings
76
77    def set_scale(self, scale: float):
78        """Set the image conditioning scale."""
79        for attn_proc in self.pipe.unet.attn_processors.values():
80            if hasattr(attn_proc, 'scale'):
81                attn_proc.scale = scale
82
83    def generate(
84        self,
85        prompt: str,
86        image: PIL.Image,
87        negative_prompt: str = "",
88        scale: float = 0.6,
89        num_inference_steps: int = 50,
90        guidance_scale: float = 7.5,
91        **kwargs
92    ) -> PIL.Image:
93        """Generate with text + image conditioning."""
94        # Encode reference image
95        image_embeddings = self.encode_image(image)
96
97        # Set conditioning scale
98        self.set_scale(scale)
99
100        # Generate
101        output = self.pipe(
102            prompt=prompt,
103            negative_prompt=negative_prompt,
104            ip_adapter_image_embeds=image_embeddings,
105            num_inference_steps=num_inference_steps,
106            guidance_scale=guidance_scale,
107            **kwargs
108        )
109
110        return output.images[0]
111
112
113class IPAttnProcessor(nn.Module):
114    """Cross-attention processor with IP-Adapter."""
115    def __init__(self, hidden_size: int, cross_attention_dim: int, num_tokens: int):
116        super().__init__()
117        self.hidden_size = hidden_size
118        self.num_tokens = num_tokens
119        self.scale = 1.0
120
121        # Image K and V projections
122        self.to_k_ip = nn.Linear(cross_attention_dim, hidden_size, bias=False)
123        self.to_v_ip = nn.Linear(cross_attention_dim, hidden_size, bias=False)
124
125    def __call__(
126        self,
127        attn,
128        hidden_states: torch.Tensor,
129        encoder_hidden_states: torch.Tensor,
130        attention_mask=None,
131        ip_adapter_image_embeds=None
132    ):
133        # Original text cross-attention
134        query = attn.to_q(hidden_states)
135        key = attn.to_k(encoder_hidden_states)
136        value = attn.to_v(encoder_hidden_states)
137
138        text_attn_output = scaled_dot_product_attention(query, key, value)
139
140        # IP-Adapter image cross-attention
141        if ip_adapter_image_embeds is not None:
142            ip_key = self.to_k_ip(ip_adapter_image_embeds)
143            ip_value = self.to_v_ip(ip_adapter_image_embeds)
144
145            ip_attn_output = scaled_dot_product_attention(query, ip_key, ip_value)
146
147            # Combine with scaling
148            output = text_attn_output + self.scale * ip_attn_output
149        else:
150            output = text_attn_output
151
152        return attn.to_out(output)

Practical Applications

IP-Adapter enables numerous practical workflows that are difficult or impossible with text prompts alone:

Style Transfer with Content Control: Use a painting as the image prompt and text for the subject. "A portrait of a young scientist" + Van Gogh reference = Van Gogh-style scientist portrait.
Product Photography Variations: Reference an existing product photo to maintain brand aesthetics while generating new scenes or angles through text.
Character Consistency: With IP-Adapter-FaceID, generate the same character across multiple scenarios while varying text prompts for different actions and settings.
Mood Boards to Images: Use multiple reference images (averaged or concatenated) to capture a complex aesthetic, then refine with text.
Texture Transfer: Extract detailed textures from photographs and apply them to generated content through image prompts.

Workflow Tip: Start with image_scale around 0.5, then adjust based on results. Too low and the image influence disappears; too high and the text prompt is ignored. The sweet spot depends on how specific your image and text requirements are.

Summary

IP-Adapter represents a significant advancement in diffusion model conditioning, enabling intuitive image-based prompting alongside traditional text. Key takeaways:

Decoupled cross-attention allows parallel text and image conditioning without interference, keeping text capabilities intact while adding image understanding.
Lightweight training (22M parameters) makes IP-Adapter accessible for customization while maintaining the base model frozen.
CLIP image features provide rich semantic embeddings that translate well across different visual domains.
Variants like FaceID and Plus specialize the approach for faces, detailed styles, and spatial composition.
Image scale control allows precise balancing between text and image influence for different creative requirements.

In the next section, we'll explore multi-modal conditioning more broadly, examining how multiple conditioning signals can be combined and weighted for fine-grained control over generation.