Chapter 5
18 min read
Section 29 of 175

Gemini Multimodal Architecture

How Gemini Agents Work

Introduction

Google's Gemini represents a fundamentally different approach to AI agents. Built from the ground up as a multimodal system, Gemini natively understands text, images, audio, video, and code - making it uniquely suited for agents that need to perceive and interact with the rich multimodal world.

The Multimodal Advantage: While other models add vision or audio as capabilities, Gemini was trained with multimodal understanding from the start. This enables more natural cross-modal reasoning and generation.

The Gemini Model Family

Gemini offers models optimized for different use cases:

ModelContextStrengthsBest For
Gemini 2.5 Pro1M tokensDeep reasoning, long contextComplex analysis, research
Gemini 2.5 Flash1M tokensSpeed, efficiency, costHigh-volume, real-time
Gemini 2.0 Flash1M tokensBalanced, multimodalGeneral purpose
Gemini 2.0 Flash Thinking32K outputExtended reasoningComplex problem-solving

Key Differentiators

  • Massive context: 1 million token context window
  • Native multimodal: Text, image, audio, video in one model
  • Thinking models: Extended reasoning like o3
  • Grounding: Integration with Google Search
  • Code execution: Native code running capability

Architecture Overview

πŸ“gemini_architecture.txt
1β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
2β”‚                    GEMINI AGENT ARCHITECTURE                   β”‚
3β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
4β”‚                                                                β”‚
5β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
6β”‚  β”‚                    INPUT LAYER                            β”‚  β”‚
7β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚  β”‚
8β”‚  β”‚  β”‚  Text    β”‚ β”‚  Image   β”‚ β”‚  Audio   β”‚ β”‚  Video   β”‚     β”‚  β”‚
9β”‚  β”‚  β”‚  Prompt  β”‚ β”‚  Files   β”‚ β”‚  Clips   β”‚ β”‚  Streams β”‚     β”‚  β”‚
10β”‚  β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜     β”‚  β”‚
11β”‚  β”‚       β”‚            β”‚            β”‚            β”‚            β”‚  β”‚
12β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
13β”‚          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚
14β”‚                              β”‚                                  β”‚
15β”‚                              β–Ό                                  β”‚
16β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
17β”‚  β”‚              MULTIMODAL ENCODER                          β”‚  β”‚
18β”‚  β”‚  Unified representation across all modalities            β”‚  β”‚
19β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
20β”‚                              β”‚                                  β”‚
21β”‚                              β–Ό                                  β”‚
22β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
23β”‚  β”‚              GEMINI CORE MODEL                            β”‚  β”‚
24β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚  β”‚
25β”‚  β”‚  β”‚  Transformer with Multimodal Attention            β”‚   β”‚  β”‚
26β”‚  β”‚  β”‚  - Cross-modal reasoning                          β”‚   β”‚  β”‚
27β”‚  β”‚  β”‚  - Extended context (1M tokens)                   β”‚   β”‚  β”‚
28β”‚  β”‚  β”‚  - Native tool calling                            β”‚   β”‚  β”‚
29β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚  β”‚
30β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
31β”‚                              β”‚                                  β”‚
32β”‚                              β–Ό                                  β”‚
33β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
34β”‚  β”‚                    OUTPUT LAYER                           β”‚  β”‚
35β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚  β”‚
36β”‚  β”‚  β”‚  Text    β”‚ β”‚  Code    β”‚ β”‚  Tool    β”‚ β”‚  Media   β”‚     β”‚  β”‚
37β”‚  β”‚  β”‚  Responseβ”‚ β”‚  Exec    β”‚ β”‚  Calls   β”‚ β”‚  Gen     β”‚     β”‚  β”‚
38β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚  β”‚
39β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
40β”‚                                                                β”‚
41β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Core Components

ComponentPurposeKey Feature
Multimodal EncoderUnified input processingSingle representation for all modalities
Cross-Modal AttentionRelate across modalitiesReason about image in context of text
Context ManagerHandle long sequences1M token context window
Tool InterfaceExternal actionsFunction calling, code execution
Grounding SystemReal-world knowledgeGoogle Search integration

Multimodal Core

Gemini's multimodal capabilities enable powerful agent behaviors:

🐍gemini_multimodal.py
1import google.generativeai as genai
2from pathlib import Path
3
4# Configure API
5genai.configure(api_key="YOUR_API_KEY")
6
7# Initialize model
8model = genai.GenerativeModel("gemini-2.5-flash")
9
10
11def analyze_image(image_path: str, question: str) -> str:
12    """Analyze an image and answer questions about it."""
13    image = genai.upload_file(image_path)
14
15    response = model.generate_content([
16        question,
17        image,
18    ])
19
20    return response.text
21
22
23def analyze_video(video_path: str, question: str) -> str:
24    """Analyze a video and answer questions about it."""
25    video = genai.upload_file(video_path)
26
27    # Wait for video processing
28    while video.state.name == "PROCESSING":
29        time.sleep(2)
30        video = genai.get_file(video.name)
31
32    response = model.generate_content([
33        question,
34        video,
35    ])
36
37    return response.text
38
39
40def analyze_audio(audio_path: str, question: str) -> str:
41    """Analyze audio and answer questions about it."""
42    audio = genai.upload_file(audio_path)
43
44    response = model.generate_content([
45        question,
46        audio,
47    ])
48
49    return response.text
50
51
52# Example: Multimodal agent task
53def multimodal_analysis(
54    text_context: str,
55    image_paths: list[str],
56    audio_path: str | None = None,
57) -> str:
58    """Combine multiple modalities for analysis."""
59
60    content = [text_context]
61
62    # Add images
63    for path in image_paths:
64        content.append(genai.upload_file(path))
65
66    # Add audio if provided
67    if audio_path:
68        content.append(genai.upload_file(audio_path))
69
70    response = model.generate_content(content)
71    return response.text

Cross-Modal Reasoning

🐍cross_modal_reasoning.py
1class CrossModalAgent:
2    """Agent that reasons across modalities."""
3
4    def __init__(self):
5        self.model = genai.GenerativeModel("gemini-2.5-pro")
6
7    def analyze_codebase_with_diagrams(
8        self,
9        code_files: list[str],
10        architecture_diagram: str,
11        question: str,
12    ) -> str:
13        """Analyze code in context of architecture diagrams."""
14
15        content = [
16            f"Question: {question}\n\n",
17            "Architecture diagram:\n",
18            genai.upload_file(architecture_diagram),
19            "\n\nCode files:\n",
20        ]
21
22        for file_path in code_files:
23            with open(file_path) as f:
24                content.append(f"\n--- {file_path} ---\n{f.read()}")
25
26        response = self.model.generate_content(content)
27        return response.text
28
29    def analyze_bug_report(
30        self,
31        bug_description: str,
32        screenshots: list[str],
33        logs: str,
34        code_context: str,
35    ) -> str:
36        """Analyze bug using text, images, and code."""
37
38        content = [
39            f"Bug report: {bug_description}\n\n",
40            "Screenshots showing the issue:\n",
41        ]
42
43        for screenshot in screenshots:
44            content.append(genai.upload_file(screenshot))
45
46        content.append(f"\n\nLogs:\n{logs}")
47        content.append(f"\n\nRelevant code:\n{code_context}")
48        content.append("\n\nAnalyze this bug and suggest fixes.")
49
50        response = self.model.generate_content(content)
51        return response.text

Agent Capabilities

Function Calling

🐍gemini_function_calling.py
1# Define tools
2def get_weather(location: str) -> dict:
3    """Get weather for a location."""
4    # Implementation
5    return {"temp": 72, "condition": "sunny"}
6
7def search_web(query: str) -> list[dict]:
8    """Search the web for information."""
9    # Implementation
10    return [{"title": "Result 1", "url": "..."}]
11
12def execute_code(code: str, language: str) -> dict:
13    """Execute code and return result."""
14    # Implementation
15    return {"output": "Hello World", "error": None}
16
17# Configure model with tools
18model = genai.GenerativeModel(
19    model_name="gemini-2.5-flash",
20    tools=[get_weather, search_web, execute_code],
21)
22
23# Chat with function calling
24chat = model.start_chat()
25
26response = chat.send_message(
27    "What's the weather in San Francisco? "
28    "Also search for the best coffee shops there."
29)
30
31# Model will call functions automatically
32print(response.text)

Code Execution

🐍gemini_code_execution.py
1# Enable code execution
2model = genai.GenerativeModel(
3    model_name="gemini-2.5-flash",
4    tools="code_execution",
5)
6
7# Ask it to solve a problem with code
8response = model.generate_content("""
9Calculate the first 20 Fibonacci numbers and plot them.
10Then find which Fibonacci number is closest to 1000.
11""")
12
13# Gemini will:
14# 1. Write Python code
15# 2. Execute it
16# 3. Return the result with any generated plots
17
18print(response.text)
19
20# Access executed code
21for part in response.parts:
22    if hasattr(part, "executable_code"):
23        print(f"Code: {part.executable_code.code}")
24    if hasattr(part, "code_execution_result"):
25        print(f"Output: {part.code_execution_result.output}")
🐍gemini_grounding.py
1from google.generativeai.types import Tool
2
3# Enable Google Search grounding
4model = genai.GenerativeModel(
5    model_name="gemini-2.5-flash",
6    tools=[Tool.from_google_search_retrieval(
7        google_search_retrieval=genai.protos.GoogleSearchRetrieval()
8    )],
9)
10
11# Query with real-time information
12response = model.generate_content(
13    "What are the latest developments in AI agents? "
14    "Include specific announcements from the past week."
15)
16
17# Response includes grounded information from Google Search
18print(response.text)
19
20# Access grounding metadata
21if response.candidates[0].grounding_metadata:
22    for chunk in response.candidates[0].grounding_metadata.grounding_chunks:
23        print(f"Source: {chunk.web.uri}")

Combine Capabilities

Gemini shines when you combine capabilities - use multimodal input with code execution and grounding for powerful agent workflows.

Summary

Gemini's multimodal architecture:

  1. Native multimodal: Text, image, audio, video in one model
  2. Massive context: 1 million token context window
  3. Cross-modal reasoning: Relate information across modalities
  4. Built-in tools: Code execution and Google Search grounding
  5. Flexible models: Pro for depth, Flash for speed
Next: Let's explore Gemini's native multimodality design and how it enables unique agent capabilities.