Chapter 13
15 min read
Section 78 of 175

Research Agent Architecture

Building a Research Agent

Introduction

Research agents are AI systems designed to autonomously search, gather, and synthesize information from multiple sources. Unlike simple chatbots that rely on their training data, research agents actively explore the web, read documents, and compile findings into coherent reports.

Key Insight: A research agent transforms the tedious process of manual research into an automated workflow that can explore dozens of sources, extract key information, and synthesize findings in a fraction of the time.

Research Agent Overview

A research agent differs from other agent types in its focus on information gathering rather than action execution. While a coding agent modifies files, a research agent reads, analyzes, and summarizes.

AspectCoding AgentResearch Agent
Primary GoalModify code and filesGather and synthesize information
Main ToolsFile read/write, executeSearch, scrape, summarize
OutputWorking code changesReports and insights
VerificationTests passSources cited and verified
IterationFix errors until completeExplore until comprehensive

Research Agent Capabilities

  • Web Search: Query search engines to find relevant sources on any topic
  • Content Extraction: Scrape and parse web pages, PDFs, and other documents
  • Information Synthesis: Combine information from multiple sources into coherent summaries
  • Source Verification: Cross-reference facts and verify claims against multiple sources
  • Citation Management: Track and cite all sources used in research

Core Components

The research agent architecture consists of several specialized components that work together:

🐍python
1from dataclasses import dataclass, field
2from typing import List, Dict, Any, Optional
3from enum import Enum
4from datetime import datetime
5from pathlib import Path
6
7
8class SourceType(Enum):
9    """Types of information sources."""
10    WEB_PAGE = "web_page"
11    PDF = "pdf"
12    ACADEMIC_PAPER = "academic_paper"
13    NEWS_ARTICLE = "news_article"
14    DOCUMENTATION = "documentation"
15    SOCIAL_MEDIA = "social_media"
16    DATABASE = "database"
17
18
19class ResearchPhase(Enum):
20    """Phases of the research process."""
21    PLANNING = "planning"
22    SEARCHING = "searching"
23    GATHERING = "gathering"
24    ANALYZING = "analyzing"
25    SYNTHESIZING = "synthesizing"
26    VERIFYING = "verifying"
27    REPORTING = "reporting"
28    COMPLETE = "complete"
29
30
31@dataclass
32class Source:
33    """Represents an information source."""
34    url: str
35    title: str
36    source_type: SourceType
37    content: str = ""
38    extracted_at: datetime = field(default_factory=datetime.now)
39    reliability_score: float = 0.0
40    metadata: Dict[str, Any] = field(default_factory=dict)
41
42    def summary(self) -> str:
43        """Get a brief summary of this source."""
44        return f"[self.source_type.value] self.title (self.url)"
45
46
47@dataclass
48class Finding:
49    """A piece of information extracted from sources."""
50    content: str
51    source_urls: List[str]
52    confidence: float
53    category: str
54    verified: bool = False
55    verification_sources: List[str] = field(default_factory=list)
56
57    def cite(self) -> str:
58        """Generate citation for this finding."""
59        sources = ", ".join(self.source_urls[:3])
60        if len(self.source_urls) > 3:
61            sources += f" (+[len(self.source_urls) - 3] more)"
62        return f"[self.content] (Sources: [sources])"
63
64
65@dataclass
66class ResearchQuery:
67    """A research query or question to investigate."""
68    question: str
69    keywords: List[str] = field(default_factory=list)
70    scope: str = "comprehensive"  # comprehensive, quick, deep
71    max_sources: int = 10
72    required_source_types: List[SourceType] = field(default_factory=list)
73    exclude_domains: List[str] = field(default_factory=list)
74
75
76@dataclass
77class ResearchState:
78    """Complete state of a research session."""
79    query: ResearchQuery
80    phase: ResearchPhase = ResearchPhase.PLANNING
81    sources: List[Source] = field(default_factory=list)
82    findings: List[Finding] = field(default_factory=list)
83    search_queries_used: List[str] = field(default_factory=list)
84    urls_visited: set = field(default_factory=set)
85    errors: List[Dict[str, str]] = field(default_factory=list)
86    started_at: datetime = field(default_factory=datetime.now)
87    iteration: int = 0
88
89    def add_source(self, source: Source) -> None:
90        """Add a source to the research."""
91        if source.url not in self.urls_visited:
92            self.sources.append(source)
93            self.urls_visited.add(source.url)
94
95    def add_finding(self, finding: Finding) -> None:
96        """Add a finding from the research."""
97        self.findings.append(finding)
98
99    def get_unverified_findings(self) -> List[Finding]:
100        """Get findings that haven't been verified yet."""
101        return [f for f in self.findings if not f.verified]
102
103    def summary(self) -> Dict[str, Any]:
104        """Get a summary of current research state."""
105        return {
106            "phase": self.phase.value,
107            "sources_count": len(self.sources),
108            "findings_count": len(self.findings),
109            "verified_findings": sum(1 for f in self.findings if f.verified),
110            "iterations": self.iteration,
111            "search_queries": len(self.search_queries_used),
112        }

Component Responsibilities

ComponentResponsibility
SourceRepresents a single information source with content and metadata
FindingA piece of verified information with citations
ResearchQueryDefines what to research and constraints
ResearchStateTracks all state throughout the research process

Information Flow

Information flows through the research agent in a structured pipeline:

🐍python
1from abc import ABC, abstractmethod
2
3
4class ResearchTool(ABC):
5    """Base class for research tools."""
6
7    @property
8    @abstractmethod
9    def name(self) -> str:
10        """Tool name."""
11        pass
12
13    @property
14    @abstractmethod
15    def description(self) -> str:
16        """Tool description for the LLM."""
17        pass
18
19    @abstractmethod
20    async def execute(self, **kwargs) -> Dict[str, Any]:
21        """Execute the tool."""
22        pass
23
24
25class SearchTool(ResearchTool):
26    """Search the web for information."""
27
28    @property
29    def name(self) -> str:
30        return "web_search"
31
32    @property
33    def description(self) -> str:
34        return """Search the web for information on a topic.
35        Args:
36            query: The search query
37            num_results: Number of results to return (default 10)
38        Returns search results with titles, URLs, and snippets."""
39
40    async def execute(self, query: str, num_results: int = 10) -> Dict[str, Any]:
41        # Implementation covered in next section
42        pass
43
44
45class ScrapeTool(ResearchTool):
46    """Extract content from a web page."""
47
48    @property
49    def name(self) -> str:
50        return "scrape_page"
51
52    @property
53    def description(self) -> str:
54        return """Extract and parse content from a web page.
55        Args:
56            url: The URL to scrape
57            extract_type: What to extract (text, links, structured)
58        Returns the extracted content."""
59
60    async def execute(self, url: str, extract_type: str = "text") -> Dict[str, Any]:
61        # Implementation covered in scraping section
62        pass
63
64
65class SynthesizeTool(ResearchTool):
66    """Synthesize information from multiple sources."""
67
68    @property
69    def name(self) -> str:
70        return "synthesize"
71
72    @property
73    def description(self) -> str:
74        return """Combine and synthesize information from gathered sources.
75        Args:
76            sources: List of source contents to synthesize
77            focus: What aspect to focus on
78        Returns synthesized summary."""
79
80    async def execute(self, sources: List[str], focus: str) -> Dict[str, Any]:
81        # Implementation covered in synthesis section
82        pass
83
84
85class ResearchToolRegistry:
86    """Registry of all research tools."""
87
88    def __init__(self):
89        self.tools: Dict[str, ResearchTool] = {}
90
91    def register(self, tool: ResearchTool) -> None:
92        """Register a tool."""
93        self.tools[tool.name] = tool
94
95    def get(self, name: str) -> Optional[ResearchTool]:
96        """Get a tool by name."""
97        return self.tools.get(name)
98
99    def list_tools(self) -> List[Dict[str, str]]:
100        """List all available tools."""
101        return [
102            {"name": t.name, "description": t.description}
103            for t in self.tools.values()
104        ]
105
106    def get_tools_prompt(self) -> str:
107        """Get tools description for LLM prompt."""
108        tools_desc = []
109        for tool in self.tools.values():
110            tools_desc.append(f"- [tool.name]: [tool.description]")
111        return "\n".join(tools_desc)
The tool registry pattern allows easy extension of the research agent with new capabilities without modifying core logic.

Research Strategies

Different research questions require different strategies. The agent should adapt its approach based on the query type:

🐍python
1from abc import ABC, abstractmethod
2
3
4class ResearchStrategy(ABC):
5    """Base class for research strategies."""
6
7    @property
8    @abstractmethod
9    def name(self) -> str:
10        pass
11
12    @abstractmethod
13    def plan_searches(self, query: ResearchQuery) -> List[str]:
14        """Generate search queries for this research."""
15        pass
16
17    @abstractmethod
18    def should_continue(self, state: ResearchState) -> bool:
19        """Determine if more research is needed."""
20        pass
21
22    @abstractmethod
23    def prioritize_sources(self, sources: List[Source]) -> List[Source]:
24        """Prioritize which sources to examine first."""
25        pass
26
27
28class ComprehensiveStrategy(ResearchStrategy):
29    """Strategy for thorough, comprehensive research."""
30
31    @property
32    def name(self) -> str:
33        return "comprehensive"
34
35    def plan_searches(self, query: ResearchQuery) -> List[str]:
36        """Generate multiple search queries for comprehensive coverage."""
37        base_question = query.question
38        keywords = query.keywords
39
40        searches = [
41            base_question,  # Direct question
42            f"[base_question] overview",
43            f"[base_question] detailed explanation",
44            f"[base_question] research papers",
45            f"[base_question] expert opinions",
46        ]
47
48        # Add keyword variations
49        for keyword in keywords[:3]:
50            searches.append(f"[keyword] [base_question]")
51
52        return searches
53
54    def should_continue(self, state: ResearchState) -> bool:
55        """Continue until we have comprehensive coverage."""
56        # Minimum requirements
57        if len(state.sources) < 5:
58            return True
59        if len(state.findings) < 3:
60            return True
61
62        # Check for unverified findings
63        unverified = state.get_unverified_findings()
64        if len(unverified) > len(state.findings) * 0.5:
65            return True
66
67        # Max iterations check
68        return state.iteration < 10
69
70    def prioritize_sources(self, sources: List[Source]) -> List[Source]:
71        """Prioritize by reliability and relevance."""
72        return sorted(sources, key=lambda s: s.reliability_score, reverse=True)
73
74
75class QuickStrategy(ResearchStrategy):
76    """Strategy for quick, focused research."""
77
78    @property
79    def name(self) -> str:
80        return "quick"
81
82    def plan_searches(self, query: ResearchQuery) -> List[str]:
83        """Generate focused search queries."""
84        return [
85            query.question,
86            f"[query.question] quick answer",
87        ]
88
89    def should_continue(self, state: ResearchState) -> bool:
90        """Stop early with sufficient information."""
91        if len(state.sources) >= 3 and len(state.findings) >= 1:
92            return False
93        return state.iteration < 3
94
95    def prioritize_sources(self, sources: List[Source]) -> List[Source]:
96        """Prioritize official and well-known sources."""
97        priority_domains = ["wikipedia.org", "gov", "edu", "official"]
98
99        def score(source: Source) -> int:
100            for i, domain in enumerate(priority_domains):
101                if domain in source.url:
102                    return len(priority_domains) - i
103            return 0
104
105        return sorted(sources, key=score, reverse=True)
106
107
108class DeepDiveStrategy(ResearchStrategy):
109    """Strategy for deep, academic-style research."""
110
111    @property
112    def name(self) -> str:
113        return "deep"
114
115    def plan_searches(self, query: ResearchQuery) -> List[str]:
116        """Generate academic-focused search queries."""
117        base = query.question
118
119        return [
120            f"[base] research paper",
121            f"[base] academic study",
122            f"[base] peer reviewed",
123            f"[base] systematic review",
124            f"[base] methodology",
125            f"[base] empirical evidence",
126        ]
127
128    def should_continue(self, state: ResearchState) -> bool:
129        """Continue until deep understanding is achieved."""
130        # Need multiple verified findings
131        verified = [f for f in state.findings if f.verified]
132        if len(verified) < 5:
133            return True
134
135        # Check for academic sources
136        academic = [s for s in state.sources
137                   if s.source_type == SourceType.ACADEMIC_PAPER]
138        if len(academic) < 2:
139            return True
140
141        return state.iteration < 15
142
143    def prioritize_sources(self, sources: List[Source]) -> List[Source]:
144        """Prioritize academic and research sources."""
145        type_priority = {
146            SourceType.ACADEMIC_PAPER: 4,
147            SourceType.DOCUMENTATION: 3,
148            SourceType.NEWS_ARTICLE: 2,
149            SourceType.WEB_PAGE: 1,
150        }
151
152        return sorted(
153            sources,
154            key=lambda s: type_priority.get(s.source_type, 0),
155            reverse=True
156        )
157
158
159def get_strategy(name: str) -> ResearchStrategy:
160    """Get a research strategy by name."""
161    strategies = {
162        "comprehensive": ComprehensiveStrategy(),
163        "quick": QuickStrategy(),
164        "deep": DeepDiveStrategy(),
165    }
166    return strategies.get(name, ComprehensiveStrategy())
Choose the right strategy based on the user's needs. Quick research for simple questions, comprehensive for general topics, and deep dive for academic or technical research.

Agent State Design

The research agent needs to track complex state across multiple phases. Here's the complete agent class structure:

🐍python
1class ResearchAgent:
2    """
3    An agent that searches, gathers, and synthesizes information.
4    """
5
6    def __init__(
7        self,
8        llm_client,
9        tool_registry: ResearchToolRegistry,
10        max_iterations: int = 15
11    ):
12        self.llm = llm_client
13        self.tools = tool_registry
14        self.max_iterations = max_iterations
15        self.strategies = {
16            "comprehensive": ComprehensiveStrategy(),
17            "quick": QuickStrategy(),
18            "deep": DeepDiveStrategy(),
19        }
20
21    async def research(
22        self,
23        question: str,
24        scope: str = "comprehensive",
25        **kwargs
26    ) -> Dict[str, Any]:
27        """
28        Conduct research on a question.
29
30        Args:
31            question: The research question
32            scope: Research scope (comprehensive, quick, deep)
33            **kwargs: Additional query parameters
34
35        Returns:
36            Research results with findings and sources
37        """
38        # Create research query
39        query = ResearchQuery(
40            question=question,
41            scope=scope,
42            **kwargs
43        )
44
45        # Initialize state
46        state = ResearchState(query=query)
47
48        # Get appropriate strategy
49        strategy = self.strategies.get(scope, self.strategies["comprehensive"])
50
51        # Run research loop
52        while strategy.should_continue(state) and state.iteration < self.max_iterations:
53            state.iteration += 1
54
55            # Determine next action based on phase
56            if state.phase == ResearchPhase.PLANNING:
57                await self._plan_research(state, strategy)
58            elif state.phase == ResearchPhase.SEARCHING:
59                await self._search_phase(state, strategy)
60            elif state.phase == ResearchPhase.GATHERING:
61                await self._gather_phase(state, strategy)
62            elif state.phase == ResearchPhase.ANALYZING:
63                await self._analyze_phase(state)
64            elif state.phase == ResearchPhase.SYNTHESIZING:
65                await self._synthesize_phase(state)
66            elif state.phase == ResearchPhase.VERIFYING:
67                await self._verify_phase(state)
68            elif state.phase == ResearchPhase.REPORTING:
69                break
70
71        # Generate final report
72        state.phase = ResearchPhase.COMPLETE
73        report = await self._generate_report(state)
74
75        return {
76            "question": question,
77            "report": report,
78            "sources": [s.summary() for s in state.sources],
79            "findings": [f.cite() for f in state.findings],
80            "metadata": state.summary()
81        }
82
83    async def _plan_research(
84        self,
85        state: ResearchState,
86        strategy: ResearchStrategy
87    ) -> None:
88        """Plan the research approach."""
89        # Generate search queries
90        searches = strategy.plan_searches(state.query)
91        state.search_queries_used.extend(searches)
92
93        # Move to searching phase
94        state.phase = ResearchPhase.SEARCHING
95
96    async def _search_phase(
97        self,
98        state: ResearchState,
99        strategy: ResearchStrategy
100    ) -> None:
101        """Execute searches to find sources."""
102        search_tool = self.tools.get("web_search")
103        if not search_tool:
104            state.phase = ResearchPhase.GATHERING
105            return
106
107        # Execute pending searches
108        for query in state.search_queries_used:
109            try:
110                results = await search_tool.execute(
111                    query=query,
112                    num_results=5
113                )
114
115                for result in results.get("results", []):
116                    source = Source(
117                        url=result["url"],
118                        title=result["title"],
119                        source_type=self._classify_source(result["url"]),
120                    )
121                    state.add_source(source)
122
123            except Exception as e:
124                state.errors.append({
125                    "phase": "searching",
126                    "query": query,
127                    "error": str(e)
128                })
129
130        # Move to gathering phase
131        state.phase = ResearchPhase.GATHERING
132
133    async def _gather_phase(
134        self,
135        state: ResearchState,
136        strategy: ResearchStrategy
137    ) -> None:
138        """Gather content from sources."""
139        scrape_tool = self.tools.get("scrape_page")
140        if not scrape_tool:
141            state.phase = ResearchPhase.ANALYZING
142            return
143
144        # Prioritize sources
145        prioritized = strategy.prioritize_sources(state.sources)
146
147        # Gather content from top sources
148        for source in prioritized[:state.query.max_sources]:
149            if source.content:  # Already gathered
150                continue
151
152            try:
153                result = await scrape_tool.execute(url=source.url)
154                source.content = result.get("content", "")
155                source.reliability_score = self._score_reliability(source)
156            except Exception as e:
157                state.errors.append({
158                    "phase": "gathering",
159                    "url": source.url,
160                    "error": str(e)
161                })
162
163        # Move to analyzing phase
164        state.phase = ResearchPhase.ANALYZING
165
166    def _classify_source(self, url: str) -> SourceType:
167        """Classify a source based on its URL."""
168        url_lower = url.lower()
169
170        if "arxiv.org" in url_lower or "doi.org" in url_lower:
171            return SourceType.ACADEMIC_PAPER
172        elif ".pdf" in url_lower:
173            return SourceType.PDF
174        elif "news" in url_lower or "bbc" in url_lower or "cnn" in url_lower:
175            return SourceType.NEWS_ARTICLE
176        elif "docs." in url_lower or "documentation" in url_lower:
177            return SourceType.DOCUMENTATION
178        elif "twitter" in url_lower or "reddit" in url_lower:
179            return SourceType.SOCIAL_MEDIA
180        else:
181            return SourceType.WEB_PAGE
182
183    def _score_reliability(self, source: Source) -> float:
184        """Score the reliability of a source."""
185        score = 0.5  # Base score
186
187        # Adjust by source type
188        type_scores = {
189            SourceType.ACADEMIC_PAPER: 0.9,
190            SourceType.DOCUMENTATION: 0.8,
191            SourceType.NEWS_ARTICLE: 0.6,
192            SourceType.WEB_PAGE: 0.5,
193            SourceType.SOCIAL_MEDIA: 0.3,
194        }
195        score = type_scores.get(source.source_type, score)
196
197        # Boost for known reliable domains
198        reliable_domains = [".gov", ".edu", "wikipedia.org"]
199        for domain in reliable_domains:
200            if domain in source.url:
201                score = min(1.0, score + 0.1)
202
203        return score
Research agents can consume significant API resources. Always implement rate limiting and cost controls for production deployments.

Summary

In this section, we established the foundation for building a research agent:

  • Core Data Structures: Source, Finding, ResearchQuery, and ResearchState classes to track all research information
  • Tool Registry: Extensible system for search, scrape, and synthesis tools
  • Research Strategies: Different approaches for quick, comprehensive, and deep research
  • Agent State Machine: Phased approach moving through planning, searching, gathering, analyzing, synthesizing, and reporting
  • Source Classification: Automatic categorization and reliability scoring of sources

In the next section, we'll implement the web search integration that allows our agent to find relevant sources across the internet.