AI Book - Master Artificial Intelligence by Building from Scratch

Introduction

Google's Gemini represents a fundamentally different approach to AI agents. Built from the ground up as a multimodal system, Gemini natively understands text, images, audio, video, and code - making it uniquely suited for agents that need to perceive and interact with the rich multimodal world.

The Multimodal Advantage: While other models add vision or audio as capabilities, Gemini was trained with multimodal understanding from the start. This enables more natural cross-modal reasoning and generation.

The Gemini Model Family

Gemini offers models optimized for different use cases:

Model	Context	Strengths	Best For
Gemini 2.5 Pro	1M tokens	Deep reasoning, long context	Complex analysis, research
Gemini 2.5 Flash	1M tokens	Speed, efficiency, cost	High-volume, real-time
Gemini 2.0 Flash	1M tokens	Balanced, multimodal	General purpose
Gemini 2.0 Flash Thinking	32K output	Extended reasoning	Complex problem-solving

Key Differentiators

Massive context: 1 million token context window
Native multimodal: Text, image, audio, video in one model
Thinking models: Extended reasoning like o3
Grounding: Integration with Google Search
Code execution: Native code running capability

Architecture Overview

📝gemini_architecture.txt

1┌────────────────────────────────────────────────────────────────┐
2│                    GEMINI AGENT ARCHITECTURE                   │
3├────────────────────────────────────────────────────────────────┤
4│                                                                │
5│  ┌──────────────────────────────────────────────────────────┐  │
6│  │                    INPUT LAYER                            │  │
7│  │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐     │  │
8│  │  │  Text    │ │  Image   │ │  Audio   │ │  Video   │     │  │
9│  │  │  Prompt  │ │  Files   │ │  Clips   │ │  Streams │     │  │
10│  │  └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘     │  │
11│  │       │            │            │            │            │  │
12│  └───────┼────────────┼────────────┼────────────┼────────────┘  │
13│          └────────────┴────────────┴────────────┘               │
14│                              │                                  │
15│                              ▼                                  │
16│  ┌──────────────────────────────────────────────────────────┐  │
17│  │              MULTIMODAL ENCODER                          │  │
18│  │  Unified representation across all modalities            │  │
19│  └──────────────────────────────────────────────────────────┘  │
20│                              │                                  │
21│                              ▼                                  │
22│  ┌──────────────────────────────────────────────────────────┐  │
23│  │              GEMINI CORE MODEL                            │  │
24│  │  ┌───────────────────────────────────────────────────┐   │  │
25│  │  │  Transformer with Multimodal Attention            │   │  │
26│  │  │  - Cross-modal reasoning                          │   │  │
27│  │  │  - Extended context (1M tokens)                   │   │  │
28│  │  │  - Native tool calling                            │   │  │
29│  │  └───────────────────────────────────────────────────┘   │  │
30│  └──────────────────────────────────────────────────────────┘  │
31│                              │                                  │
32│                              ▼                                  │
33│  ┌──────────────────────────────────────────────────────────┐  │
34│  │                    OUTPUT LAYER                           │  │
35│  │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐     │  │
36│  │  │  Text    │ │  Code    │ │  Tool    │ │  Media   │     │  │
37│  │  │  Response│ │  Exec    │ │  Calls   │ │  Gen     │     │  │
38│  │  └──────────┘ └──────────┘ └──────────┘ └──────────┘     │  │
39│  └──────────────────────────────────────────────────────────┘  │
40│                                                                │
41└────────────────────────────────────────────────────────────────┘

Core Components

Component	Purpose	Key Feature
Multimodal Encoder	Unified input processing	Single representation for all modalities
Cross-Modal Attention	Relate across modalities	Reason about image in context of text
Context Manager	Handle long sequences	1M token context window
Tool Interface	External actions	Function calling, code execution
Grounding System	Real-world knowledge	Google Search integration

Multimodal Core

Gemini's multimodal capabilities enable powerful agent behaviors:

🐍gemini_multimodal.py

1import google.generativeai as genai
2from pathlib import Path
3
4# Configure API
5genai.configure(api_key="YOUR_API_KEY")
6
7# Initialize model
8model = genai.GenerativeModel("gemini-2.5-flash")
9
10
11def analyze_image(image_path: str, question: str) -> str:
12    """Analyze an image and answer questions about it."""
13    image = genai.upload_file(image_path)
14
15    response = model.generate_content([
16        question,
17        image,
18    ])
19
20    return response.text
21
22
23def analyze_video(video_path: str, question: str) -> str:
24    """Analyze a video and answer questions about it."""
25    video = genai.upload_file(video_path)
26
27    # Wait for video processing
28    while video.state.name == "PROCESSING":
29        time.sleep(2)
30        video = genai.get_file(video.name)
31
32    response = model.generate_content([
33        question,
34        video,
35    ])
36
37    return response.text
38
39
40def analyze_audio(audio_path: str, question: str) -> str:
41    """Analyze audio and answer questions about it."""
42    audio = genai.upload_file(audio_path)
43
44    response = model.generate_content([
45        question,
46        audio,
47    ])
48
49    return response.text
50
51
52# Example: Multimodal agent task
53def multimodal_analysis(
54    text_context: str,
55    image_paths: list[str],
56    audio_path: str | None = None,
57) -> str:
58    """Combine multiple modalities for analysis."""
59
60    content = [text_context]
61
62    # Add images
63    for path in image_paths:
64        content.append(genai.upload_file(path))
65
66    # Add audio if provided
67    if audio_path:
68        content.append(genai.upload_file(audio_path))
69
70    response = model.generate_content(content)
71    return response.text

🐍cross_modal_reasoning.py

1class CrossModalAgent:
2    """Agent that reasons across modalities."""
3
4    def __init__(self):
5        self.model = genai.GenerativeModel("gemini-2.5-pro")
6
7    def analyze_codebase_with_diagrams(
8        self,
9        code_files: list[str],
10        architecture_diagram: str,
11        question: str,
12    ) -> str:
13        """Analyze code in context of architecture diagrams."""
14
15        content = [
16            f"Question: {question}\n\n",
17            "Architecture diagram:\n",
18            genai.upload_file(architecture_diagram),
19            "\n\nCode files:\n",
20        ]
21
22        for file_path in code_files:
23            with open(file_path) as f:
24                content.append(f"\n--- {file_path} ---\n{f.read()}")
25
26        response = self.model.generate_content(content)
27        return response.text
28
29    def analyze_bug_report(
30        self,
31        bug_description: str,
32        screenshots: list[str],
33        logs: str,
34        code_context: str,
35    ) -> str:
36        """Analyze bug using text, images, and code."""
37
38        content = [
39            f"Bug report: {bug_description}\n\n",
40            "Screenshots showing the issue:\n",
41        ]
42
43        for screenshot in screenshots:
44            content.append(genai.upload_file(screenshot))
45
46        content.append(f"\n\nLogs:\n{logs}")
47        content.append(f"\n\nRelevant code:\n{code_context}")
48        content.append("\n\nAnalyze this bug and suggest fixes.")
49
50        response = self.model.generate_content(content)
51        return response.text

Agent Capabilities

Function Calling

🐍gemini_function_calling.py

1# Define tools
2def get_weather(location: str) -> dict:
3    """Get weather for a location."""
4    # Implementation
5    return {"temp": 72, "condition": "sunny"}
6
7def search_web(query: str) -> list[dict]:
8    """Search the web for information."""
9    # Implementation
10    return [{"title": "Result 1", "url": "..."}]
11
12def execute_code(code: str, language: str) -> dict:
13    """Execute code and return result."""
14    # Implementation
15    return {"output": "Hello World", "error": None}
16
17# Configure model with tools
18model = genai.GenerativeModel(
19    model_name="gemini-2.5-flash",
20    tools=[get_weather, search_web, execute_code],
21)
22
23# Chat with function calling
24chat = model.start_chat()
25
26response = chat.send_message(
27    "What's the weather in San Francisco? "
28    "Also search for the best coffee shops there."
29)
30
31# Model will call functions automatically
32print(response.text)

Code Execution

🐍gemini_code_execution.py

1# Enable code execution
2model = genai.GenerativeModel(
3    model_name="gemini-2.5-flash",
4    tools="code_execution",
5)
6
7# Ask it to solve a problem with code
8response = model.generate_content("""
9Calculate the first 20 Fibonacci numbers and plot them.
10Then find which Fibonacci number is closest to 1000.
11""")
12
13# Gemini will:
14# 1. Write Python code
15# 2. Execute it
16# 3. Return the result with any generated plots
17
18print(response.text)
19
20# Access executed code
21for part in response.parts:
22    if hasattr(part, "executable_code"):
23        print(f"Code: {part.executable_code.code}")
24    if hasattr(part, "code_execution_result"):
25        print(f"Output: {part.code_execution_result.output}")

Grounding with Search

🐍gemini_grounding.py

1from google.generativeai.types import Tool
2
3# Enable Google Search grounding
4model = genai.GenerativeModel(
5    model_name="gemini-2.5-flash",
6    tools=[Tool.from_google_search_retrieval(
7        google_search_retrieval=genai.protos.GoogleSearchRetrieval()
8    )],
9)
10
11# Query with real-time information
12response = model.generate_content(
13    "What are the latest developments in AI agents? "
14    "Include specific announcements from the past week."
15)
16
17# Response includes grounded information from Google Search
18print(response.text)
19
20# Access grounding metadata
21if response.candidates[0].grounding_metadata:
22    for chunk in response.candidates[0].grounding_metadata.grounding_chunks:
23        print(f"Source: {chunk.web.uri}")

Combine Capabilities

Gemini shines when you combine capabilities - use multimodal input with code execution and grounding for powerful agent workflows.

Summary

Gemini's multimodal architecture:

Native multimodal: Text, image, audio, video in one model
Massive context: 1 million token context window
Cross-modal reasoning: Relate information across modalities
Built-in tools: Code execution and Google Search grounding
Flexible models: Pro for depth, Flash for speed

Next: Let's explore Gemini's native multimodality design and how it enables unique agent capabilities.