Chapter 5
15 min read
Section 30 of 175

Native Multimodality Design

How Gemini Agents Work

Introduction

Gemini was designed from the ground up to understand multiple modalities together. This "native multimodality" approach differs fundamentally from systems that add vision or audio capabilities to a text-based model.

Native vs Bolted-On: When multimodality is native, the model doesn't translate images to text internally - it reasons about images as images, audio as audio, creating richer understanding.

Native vs Added Multimodality

AspectNative Multimodal (Gemini)Added Multimodal
TrainingAll modalities from startVision/audio added later
RepresentationUnified embedding spaceSeparate encoders combined
ReasoningDirect cross-modalTranslation to text first
PerformanceBetter on complex cross-modalGood on single modality
LatencySingle forward passMultiple passes possible

Why Native Multimodality Matters for Agents

  • Richer perception: Understand visual context naturally
  • Better grounding: Connect text to visual elements
  • Seamless transitions: Switch between modalities fluently
  • Reduced errors: No lossy translation between modalities
📝multimodal_comparison.txt
1Task: "What's wrong with this UI screenshot?"
2
3Added Multimodal Approach:
41. Image encoder extracts features
52. Features converted to text description
63. Text model reasons about description
74. Some visual details lost in translation
8
9Native Multimodal Approach:
101. Image and text processed together
112. Direct attention between query and image regions
123. Model "sees" and reasons simultaneously
134. Subtle visual issues detected

Handling Different Modalities

Image Processing

🐍image_processing.py
1import google.generativeai as genai
2from PIL import Image
3import io
4
5class GeminiImageProcessor:
6    """Process images for Gemini agents."""
7
8    def __init__(self, model_name: str = "gemini-2.5-flash"):
9        self.model = genai.GenerativeModel(model_name)
10
11    def analyze_screenshot(self, image_path: str) -> dict:
12        """Analyze a UI screenshot."""
13        image = genai.upload_file(image_path)
14
15        response = self.model.generate_content([
16            """Analyze this UI screenshot. Provide:
171. UI elements present
182. Layout structure
193. Any usability issues
204. Accessibility concerns
215. Design suggestions""",
22            image,
23        ])
24
25        return self._parse_analysis(response.text)
26
27    def extract_text_from_image(self, image_path: str) -> str:
28        """OCR with context understanding."""
29        image = genai.upload_file(image_path)
30
31        response = self.model.generate_content([
32            "Extract all text from this image, preserving structure.",
33            image,
34        ])
35
36        return response.text
37
38    def compare_designs(
39        self,
40        before_path: str,
41        after_path: str,
42    ) -> str:
43        """Compare two design versions."""
44        before = genai.upload_file(before_path)
45        after = genai.upload_file(after_path)
46
47        response = self.model.generate_content([
48            "Compare these two UI designs. What changed? Is the new version better?",
49            "Before:",
50            before,
51            "After:",
52            after,
53        ])
54
55        return response.text

Video Processing

🐍video_processing.py
1class GeminiVideoProcessor:
2    """Process video for Gemini agents."""
3
4    def __init__(self):
5        self.model = genai.GenerativeModel("gemini-2.5-pro")
6
7    def analyze_screen_recording(self, video_path: str) -> dict:
8        """Analyze a screen recording for bugs or issues."""
9        video = self._upload_and_wait(video_path)
10
11        response = self.model.generate_content([
12            """Analyze this screen recording:
131. What is the user trying to do?
142. Are there any errors or bugs visible?
153. Where does the user seem confused?
164. What UI improvements would help?
175. Timestamp any issues you find.""",
18            video,
19        ])
20
21        return self._parse_video_analysis(response.text)
22
23    def extract_tutorial_steps(self, video_path: str) -> list[dict]:
24        """Extract step-by-step instructions from a tutorial video."""
25        video = self._upload_and_wait(video_path)
26
27        response = self.model.generate_content([
28            """Watch this tutorial and extract:
291. Each step the user performs
302. Timestamp for each step
313. Any keyboard shortcuts or commands used
324. Common mistakes to avoid
33
34Format as a numbered list with timestamps.""",
35            video,
36        ])
37
38        return self._parse_steps(response.text)
39
40    def _upload_and_wait(self, video_path: str):
41        """Upload video and wait for processing."""
42        video = genai.upload_file(video_path)
43
44        while video.state.name == "PROCESSING":
45            time.sleep(5)
46            video = genai.get_file(video.name)
47
48        if video.state.name == "FAILED":
49            raise ValueError(f"Video processing failed: {video.name}")
50
51        return video

Audio Processing

🐍audio_processing.py
1class GeminiAudioProcessor:
2    """Process audio for Gemini agents."""
3
4    def __init__(self):
5        self.model = genai.GenerativeModel("gemini-2.5-flash")
6
7    def transcribe_with_context(self, audio_path: str) -> dict:
8        """Transcribe audio with speaker identification and context."""
9        audio = genai.upload_file(audio_path)
10
11        response = self.model.generate_content([
12            """Transcribe this audio:
131. Identify different speakers
142. Note any background sounds
153. Capture tone and emotion
164. Provide timestamps for key points""",
17            audio,
18        ])
19
20        return self._parse_transcription(response.text)
21
22    def analyze_meeting_audio(
23        self,
24        audio_path: str,
25        context: str,
26    ) -> dict:
27        """Analyze a meeting recording."""
28        audio = genai.upload_file(audio_path)
29
30        response = self.model.generate_content([
31            f"Context: {context}",
32            """Analyze this meeting recording:
331. Summary of discussion
342. Key decisions made
353. Action items identified
364. Questions raised but not answered
375. Follow-up needed""",
38            audio,
39        ])
40
41        return self._parse_meeting_analysis(response.text)

Multimodal Agent Patterns

Pattern 1: Visual Code Review Agent

🐍visual_code_review.py
1class VisualCodeReviewAgent:
2    """Review code with visual context."""
3
4    def __init__(self):
5        self.model = genai.GenerativeModel("gemini-2.5-pro")
6
7    def review_with_screenshots(
8        self,
9        code_diff: str,
10        before_screenshot: str,
11        after_screenshot: str,
12    ) -> str:
13        """Review code changes with visual before/after."""
14
15        response = self.model.generate_content([
16            "Review this code change with visual context:",
17            "\n## Code Diff:\n" + code_diff,
18            "\n## Before (UI):",
19            genai.upload_file(before_screenshot),
20            "\n## After (UI):",
21            genai.upload_file(after_screenshot),
22            """\n\nProvide:
231. Does the code change match the visual change?
242. Any visual regressions?
253. Accessibility impact?
264. Code quality assessment""",
27        ])
28
29        return response.text
30
31    def review_component_implementation(
32        self,
33        design_mockup: str,
34        implementation_screenshot: str,
35        component_code: str,
36    ) -> str:
37        """Compare implementation to design."""
38
39        response = self.model.generate_content([
40            "Compare this component implementation to its design:",
41            "\n## Design Mockup:",
42            genai.upload_file(design_mockup),
43            "\n## Implementation:",
44            genai.upload_file(implementation_screenshot),
45            "\n## Component Code:\n" + component_code,
46            """\n\nAssess:
471. Pixel accuracy to design
482. Missing elements
493. Spacing/margin differences
504. Color matching
515. Responsive considerations""",
52        ])
53
54        return response.text

Pattern 2: Documentation Agent

🐍documentation_agent.py
1class MultimodalDocumentationAgent:
2    """Generate documentation from multiple sources."""
3
4    def __init__(self):
5        self.model = genai.GenerativeModel("gemini-2.5-pro")
6
7    def generate_from_demo(
8        self,
9        screen_recording: str,
10        code_files: list[str],
11        existing_docs: str | None = None,
12    ) -> str:
13        """Generate docs from demo video and code."""
14
15        content = [
16            "Generate comprehensive documentation based on:\n",
17            "## Demo Video:",
18            self._upload_video(screen_recording),
19            "\n## Source Code:\n",
20        ]
21
22        for file_path in code_files:
23            with open(file_path) as f:
24                content.append(f"\n### {file_path}\n{f.read()}")
25
26        if existing_docs:
27            content.append(f"\n## Existing Documentation:\n{existing_docs}")
28
29        content.append("""
30
31Generate documentation including:
321. Overview (from demo observation)
332. Installation steps
343. Usage examples (from video)
354. API reference (from code)
365. Screenshots with annotations (describe key frames)
37""")
38
39        response = self.model.generate_content(content)
40        return response.text
41
42    def _upload_video(self, path: str):
43        video = genai.upload_file(path)
44        while video.state.name == "PROCESSING":
45            time.sleep(2)
46            video = genai.get_file(video.name)
47        return video

Pattern 3: Bug Report Analysis Agent

🐍bug_analysis_agent.py
1class BugAnalysisAgent:
2    """Analyze bug reports with multimodal context."""
3
4    def __init__(self):
5        self.model = genai.GenerativeModel("gemini-2.5-pro")
6
7    def analyze_bug(
8        self,
9        description: str,
10        screenshots: list[str],
11        console_logs: str,
12        reproduction_video: str | None = None,
13        relevant_code: dict[str, str] | None = None,
14    ) -> dict:
15        """Comprehensive bug analysis."""
16
17        content = [f"Bug Report: {description}\n\n"]
18
19        # Add screenshots
20        content.append("Screenshots:\n")
21        for i, screenshot in enumerate(screenshots):
22            content.append(f"Screenshot {i+1}:")
23            content.append(genai.upload_file(screenshot))
24
25        # Add console logs
26        content.append(f"\n\nConsole Logs:\n{console_logs}")
27
28        # Add reproduction video if available
29        if reproduction_video:
30            content.append("\n\nReproduction Video:")
31            content.append(self._upload_video(reproduction_video))
32
33        # Add relevant code
34        if relevant_code:
35            content.append("\n\nRelevant Code:")
36            for file, code in relevant_code.items():
37                content.append(f"\n### {file}\n{code}")
38
39        content.append("""
40
41Analyze this bug:
421. Root cause hypothesis
432. Steps to reproduce (from video if available)
443. Affected code areas
454. Suggested fix
465. Testing approach
476. Risk assessment
48""")
49
50        response = self.model.generate_content(content)
51        return self._parse_bug_analysis(response.text)

Implementation Examples

Complete Multimodal Agent

🐍multimodal_agent.py
1import google.generativeai as genai
2from dataclasses import dataclass
3from typing import Any
4
5@dataclass
6class MultimodalInput:
7    text: str
8    images: list[str] = None
9    videos: list[str] = None
10    audio: list[str] = None
11
12    def to_content(self) -> list:
13        """Convert to Gemini content format."""
14        content = [self.text]
15
16        if self.images:
17            for img in self.images:
18                content.append(genai.upload_file(img))
19
20        if self.videos:
21            for vid in self.videos:
22                content.append(self._upload_video(vid))
23
24        if self.audio:
25            for aud in self.audio:
26                content.append(genai.upload_file(aud))
27
28        return content
29
30    def _upload_video(self, path: str):
31        video = genai.upload_file(path)
32        while video.state.name == "PROCESSING":
33            time.sleep(2)
34            video = genai.get_file(video.name)
35        return video
36
37
38class MultimodalAgent:
39    """Agent that processes multimodal inputs."""
40
41    def __init__(self, model_name: str = "gemini-2.5-pro"):
42        self.model = genai.GenerativeModel(
43            model_name=model_name,
44            tools=[self.search_codebase, self.execute_command],
45        )
46        self.chat = None
47
48    def start_session(self, system_prompt: str) -> None:
49        """Start a new conversation session."""
50        self.chat = self.model.start_chat(history=[])
51        # Send system context as first message
52        self.chat.send_message(system_prompt)
53
54    def process(self, input: MultimodalInput) -> str:
55        """Process multimodal input and return response."""
56        if not self.chat:
57            self.start_session("You are a helpful coding assistant.")
58
59        content = input.to_content()
60        response = self.chat.send_message(content)
61
62        return response.text
63
64    def search_codebase(self, query: str) -> str:
65        """Search the codebase for relevant files."""
66        # Implementation
67        return "Found files..."
68
69    def execute_command(self, command: str) -> str:
70        """Execute a shell command."""
71        # Implementation with safety checks
72        return "Command output..."
73
74
75# Usage
76agent = MultimodalAgent()
77agent.start_session("""
78You are a senior developer helping debug issues.
79Analyze all provided context - text, images, videos.
80Be thorough and specific in your analysis.
81""")
82
83result = agent.process(MultimodalInput(
84    text="Why is this button not working?",
85    images=["screenshot.png", "console_error.png"],
86    videos=["reproduction.mp4"],
87))

Modality Selection

Not every task needs all modalities. Use text for simple queries, add images for UI issues, add video for complex interactions. Match modality to task requirements.

Summary

Native multimodality in Gemini:

  1. Native design: All modalities trained together
  2. Richer understanding: Direct cross-modal reasoning
  3. Multiple inputs: Images, video, audio, text combined
  4. Agent patterns: Visual code review, documentation, bug analysis
  5. Implementation: Unified API for all modalities
Next: Let's explore Gemini's controllable reasoning depth and how thinking models enable complex problem-solving.