Introduction
Research agents are AI systems designed to autonomously search, gather, and synthesize information from multiple sources. Unlike simple chatbots that rely on their training data, research agents actively explore the web, read documents, and compile findings into coherent reports.
Key Insight: A research agent transforms the tedious process of manual research into an automated workflow that can explore dozens of sources, extract key information, and synthesize findings in a fraction of the time.
Research Agent Overview
A research agent differs from other agent types in its focus on information gathering rather than action execution. While a coding agent modifies files, a research agent reads, analyzes, and summarizes.
| Aspect | Coding Agent | Research Agent |
|---|---|---|
| Primary Goal | Modify code and files | Gather and synthesize information |
| Main Tools | File read/write, execute | Search, scrape, summarize |
| Output | Working code changes | Reports and insights |
| Verification | Tests pass | Sources cited and verified |
| Iteration | Fix errors until complete | Explore until comprehensive |
Research Agent Capabilities
- Web Search: Query search engines to find relevant sources on any topic
- Content Extraction: Scrape and parse web pages, PDFs, and other documents
- Information Synthesis: Combine information from multiple sources into coherent summaries
- Source Verification: Cross-reference facts and verify claims against multiple sources
- Citation Management: Track and cite all sources used in research
Core Components
The research agent architecture consists of several specialized components that work together:
1from dataclasses import dataclass, field
2from typing import List, Dict, Any, Optional
3from enum import Enum
4from datetime import datetime
5from pathlib import Path
6
7
8class SourceType(Enum):
9 """Types of information sources."""
10 WEB_PAGE = "web_page"
11 PDF = "pdf"
12 ACADEMIC_PAPER = "academic_paper"
13 NEWS_ARTICLE = "news_article"
14 DOCUMENTATION = "documentation"
15 SOCIAL_MEDIA = "social_media"
16 DATABASE = "database"
17
18
19class ResearchPhase(Enum):
20 """Phases of the research process."""
21 PLANNING = "planning"
22 SEARCHING = "searching"
23 GATHERING = "gathering"
24 ANALYZING = "analyzing"
25 SYNTHESIZING = "synthesizing"
26 VERIFYING = "verifying"
27 REPORTING = "reporting"
28 COMPLETE = "complete"
29
30
31@dataclass
32class Source:
33 """Represents an information source."""
34 url: str
35 title: str
36 source_type: SourceType
37 content: str = ""
38 extracted_at: datetime = field(default_factory=datetime.now)
39 reliability_score: float = 0.0
40 metadata: Dict[str, Any] = field(default_factory=dict)
41
42 def summary(self) -> str:
43 """Get a brief summary of this source."""
44 return f"[self.source_type.value] self.title (self.url)"
45
46
47@dataclass
48class Finding:
49 """A piece of information extracted from sources."""
50 content: str
51 source_urls: List[str]
52 confidence: float
53 category: str
54 verified: bool = False
55 verification_sources: List[str] = field(default_factory=list)
56
57 def cite(self) -> str:
58 """Generate citation for this finding."""
59 sources = ", ".join(self.source_urls[:3])
60 if len(self.source_urls) > 3:
61 sources += f" (+[len(self.source_urls) - 3] more)"
62 return f"[self.content] (Sources: [sources])"
63
64
65@dataclass
66class ResearchQuery:
67 """A research query or question to investigate."""
68 question: str
69 keywords: List[str] = field(default_factory=list)
70 scope: str = "comprehensive" # comprehensive, quick, deep
71 max_sources: int = 10
72 required_source_types: List[SourceType] = field(default_factory=list)
73 exclude_domains: List[str] = field(default_factory=list)
74
75
76@dataclass
77class ResearchState:
78 """Complete state of a research session."""
79 query: ResearchQuery
80 phase: ResearchPhase = ResearchPhase.PLANNING
81 sources: List[Source] = field(default_factory=list)
82 findings: List[Finding] = field(default_factory=list)
83 search_queries_used: List[str] = field(default_factory=list)
84 urls_visited: set = field(default_factory=set)
85 errors: List[Dict[str, str]] = field(default_factory=list)
86 started_at: datetime = field(default_factory=datetime.now)
87 iteration: int = 0
88
89 def add_source(self, source: Source) -> None:
90 """Add a source to the research."""
91 if source.url not in self.urls_visited:
92 self.sources.append(source)
93 self.urls_visited.add(source.url)
94
95 def add_finding(self, finding: Finding) -> None:
96 """Add a finding from the research."""
97 self.findings.append(finding)
98
99 def get_unverified_findings(self) -> List[Finding]:
100 """Get findings that haven't been verified yet."""
101 return [f for f in self.findings if not f.verified]
102
103 def summary(self) -> Dict[str, Any]:
104 """Get a summary of current research state."""
105 return {
106 "phase": self.phase.value,
107 "sources_count": len(self.sources),
108 "findings_count": len(self.findings),
109 "verified_findings": sum(1 for f in self.findings if f.verified),
110 "iterations": self.iteration,
111 "search_queries": len(self.search_queries_used),
112 }Component Responsibilities
| Component | Responsibility |
|---|---|
| Source | Represents a single information source with content and metadata |
| Finding | A piece of verified information with citations |
| ResearchQuery | Defines what to research and constraints |
| ResearchState | Tracks all state throughout the research process |
Information Flow
Information flows through the research agent in a structured pipeline:
1from abc import ABC, abstractmethod
2
3
4class ResearchTool(ABC):
5 """Base class for research tools."""
6
7 @property
8 @abstractmethod
9 def name(self) -> str:
10 """Tool name."""
11 pass
12
13 @property
14 @abstractmethod
15 def description(self) -> str:
16 """Tool description for the LLM."""
17 pass
18
19 @abstractmethod
20 async def execute(self, **kwargs) -> Dict[str, Any]:
21 """Execute the tool."""
22 pass
23
24
25class SearchTool(ResearchTool):
26 """Search the web for information."""
27
28 @property
29 def name(self) -> str:
30 return "web_search"
31
32 @property
33 def description(self) -> str:
34 return """Search the web for information on a topic.
35 Args:
36 query: The search query
37 num_results: Number of results to return (default 10)
38 Returns search results with titles, URLs, and snippets."""
39
40 async def execute(self, query: str, num_results: int = 10) -> Dict[str, Any]:
41 # Implementation covered in next section
42 pass
43
44
45class ScrapeTool(ResearchTool):
46 """Extract content from a web page."""
47
48 @property
49 def name(self) -> str:
50 return "scrape_page"
51
52 @property
53 def description(self) -> str:
54 return """Extract and parse content from a web page.
55 Args:
56 url: The URL to scrape
57 extract_type: What to extract (text, links, structured)
58 Returns the extracted content."""
59
60 async def execute(self, url: str, extract_type: str = "text") -> Dict[str, Any]:
61 # Implementation covered in scraping section
62 pass
63
64
65class SynthesizeTool(ResearchTool):
66 """Synthesize information from multiple sources."""
67
68 @property
69 def name(self) -> str:
70 return "synthesize"
71
72 @property
73 def description(self) -> str:
74 return """Combine and synthesize information from gathered sources.
75 Args:
76 sources: List of source contents to synthesize
77 focus: What aspect to focus on
78 Returns synthesized summary."""
79
80 async def execute(self, sources: List[str], focus: str) -> Dict[str, Any]:
81 # Implementation covered in synthesis section
82 pass
83
84
85class ResearchToolRegistry:
86 """Registry of all research tools."""
87
88 def __init__(self):
89 self.tools: Dict[str, ResearchTool] = {}
90
91 def register(self, tool: ResearchTool) -> None:
92 """Register a tool."""
93 self.tools[tool.name] = tool
94
95 def get(self, name: str) -> Optional[ResearchTool]:
96 """Get a tool by name."""
97 return self.tools.get(name)
98
99 def list_tools(self) -> List[Dict[str, str]]:
100 """List all available tools."""
101 return [
102 {"name": t.name, "description": t.description}
103 for t in self.tools.values()
104 ]
105
106 def get_tools_prompt(self) -> str:
107 """Get tools description for LLM prompt."""
108 tools_desc = []
109 for tool in self.tools.values():
110 tools_desc.append(f"- [tool.name]: [tool.description]")
111 return "\n".join(tools_desc)Research Strategies
Different research questions require different strategies. The agent should adapt its approach based on the query type:
1from abc import ABC, abstractmethod
2
3
4class ResearchStrategy(ABC):
5 """Base class for research strategies."""
6
7 @property
8 @abstractmethod
9 def name(self) -> str:
10 pass
11
12 @abstractmethod
13 def plan_searches(self, query: ResearchQuery) -> List[str]:
14 """Generate search queries for this research."""
15 pass
16
17 @abstractmethod
18 def should_continue(self, state: ResearchState) -> bool:
19 """Determine if more research is needed."""
20 pass
21
22 @abstractmethod
23 def prioritize_sources(self, sources: List[Source]) -> List[Source]:
24 """Prioritize which sources to examine first."""
25 pass
26
27
28class ComprehensiveStrategy(ResearchStrategy):
29 """Strategy for thorough, comprehensive research."""
30
31 @property
32 def name(self) -> str:
33 return "comprehensive"
34
35 def plan_searches(self, query: ResearchQuery) -> List[str]:
36 """Generate multiple search queries for comprehensive coverage."""
37 base_question = query.question
38 keywords = query.keywords
39
40 searches = [
41 base_question, # Direct question
42 f"[base_question] overview",
43 f"[base_question] detailed explanation",
44 f"[base_question] research papers",
45 f"[base_question] expert opinions",
46 ]
47
48 # Add keyword variations
49 for keyword in keywords[:3]:
50 searches.append(f"[keyword] [base_question]")
51
52 return searches
53
54 def should_continue(self, state: ResearchState) -> bool:
55 """Continue until we have comprehensive coverage."""
56 # Minimum requirements
57 if len(state.sources) < 5:
58 return True
59 if len(state.findings) < 3:
60 return True
61
62 # Check for unverified findings
63 unverified = state.get_unverified_findings()
64 if len(unverified) > len(state.findings) * 0.5:
65 return True
66
67 # Max iterations check
68 return state.iteration < 10
69
70 def prioritize_sources(self, sources: List[Source]) -> List[Source]:
71 """Prioritize by reliability and relevance."""
72 return sorted(sources, key=lambda s: s.reliability_score, reverse=True)
73
74
75class QuickStrategy(ResearchStrategy):
76 """Strategy for quick, focused research."""
77
78 @property
79 def name(self) -> str:
80 return "quick"
81
82 def plan_searches(self, query: ResearchQuery) -> List[str]:
83 """Generate focused search queries."""
84 return [
85 query.question,
86 f"[query.question] quick answer",
87 ]
88
89 def should_continue(self, state: ResearchState) -> bool:
90 """Stop early with sufficient information."""
91 if len(state.sources) >= 3 and len(state.findings) >= 1:
92 return False
93 return state.iteration < 3
94
95 def prioritize_sources(self, sources: List[Source]) -> List[Source]:
96 """Prioritize official and well-known sources."""
97 priority_domains = ["wikipedia.org", "gov", "edu", "official"]
98
99 def score(source: Source) -> int:
100 for i, domain in enumerate(priority_domains):
101 if domain in source.url:
102 return len(priority_domains) - i
103 return 0
104
105 return sorted(sources, key=score, reverse=True)
106
107
108class DeepDiveStrategy(ResearchStrategy):
109 """Strategy for deep, academic-style research."""
110
111 @property
112 def name(self) -> str:
113 return "deep"
114
115 def plan_searches(self, query: ResearchQuery) -> List[str]:
116 """Generate academic-focused search queries."""
117 base = query.question
118
119 return [
120 f"[base] research paper",
121 f"[base] academic study",
122 f"[base] peer reviewed",
123 f"[base] systematic review",
124 f"[base] methodology",
125 f"[base] empirical evidence",
126 ]
127
128 def should_continue(self, state: ResearchState) -> bool:
129 """Continue until deep understanding is achieved."""
130 # Need multiple verified findings
131 verified = [f for f in state.findings if f.verified]
132 if len(verified) < 5:
133 return True
134
135 # Check for academic sources
136 academic = [s for s in state.sources
137 if s.source_type == SourceType.ACADEMIC_PAPER]
138 if len(academic) < 2:
139 return True
140
141 return state.iteration < 15
142
143 def prioritize_sources(self, sources: List[Source]) -> List[Source]:
144 """Prioritize academic and research sources."""
145 type_priority = {
146 SourceType.ACADEMIC_PAPER: 4,
147 SourceType.DOCUMENTATION: 3,
148 SourceType.NEWS_ARTICLE: 2,
149 SourceType.WEB_PAGE: 1,
150 }
151
152 return sorted(
153 sources,
154 key=lambda s: type_priority.get(s.source_type, 0),
155 reverse=True
156 )
157
158
159def get_strategy(name: str) -> ResearchStrategy:
160 """Get a research strategy by name."""
161 strategies = {
162 "comprehensive": ComprehensiveStrategy(),
163 "quick": QuickStrategy(),
164 "deep": DeepDiveStrategy(),
165 }
166 return strategies.get(name, ComprehensiveStrategy())Agent State Design
The research agent needs to track complex state across multiple phases. Here's the complete agent class structure:
1class ResearchAgent:
2 """
3 An agent that searches, gathers, and synthesizes information.
4 """
5
6 def __init__(
7 self,
8 llm_client,
9 tool_registry: ResearchToolRegistry,
10 max_iterations: int = 15
11 ):
12 self.llm = llm_client
13 self.tools = tool_registry
14 self.max_iterations = max_iterations
15 self.strategies = {
16 "comprehensive": ComprehensiveStrategy(),
17 "quick": QuickStrategy(),
18 "deep": DeepDiveStrategy(),
19 }
20
21 async def research(
22 self,
23 question: str,
24 scope: str = "comprehensive",
25 **kwargs
26 ) -> Dict[str, Any]:
27 """
28 Conduct research on a question.
29
30 Args:
31 question: The research question
32 scope: Research scope (comprehensive, quick, deep)
33 **kwargs: Additional query parameters
34
35 Returns:
36 Research results with findings and sources
37 """
38 # Create research query
39 query = ResearchQuery(
40 question=question,
41 scope=scope,
42 **kwargs
43 )
44
45 # Initialize state
46 state = ResearchState(query=query)
47
48 # Get appropriate strategy
49 strategy = self.strategies.get(scope, self.strategies["comprehensive"])
50
51 # Run research loop
52 while strategy.should_continue(state) and state.iteration < self.max_iterations:
53 state.iteration += 1
54
55 # Determine next action based on phase
56 if state.phase == ResearchPhase.PLANNING:
57 await self._plan_research(state, strategy)
58 elif state.phase == ResearchPhase.SEARCHING:
59 await self._search_phase(state, strategy)
60 elif state.phase == ResearchPhase.GATHERING:
61 await self._gather_phase(state, strategy)
62 elif state.phase == ResearchPhase.ANALYZING:
63 await self._analyze_phase(state)
64 elif state.phase == ResearchPhase.SYNTHESIZING:
65 await self._synthesize_phase(state)
66 elif state.phase == ResearchPhase.VERIFYING:
67 await self._verify_phase(state)
68 elif state.phase == ResearchPhase.REPORTING:
69 break
70
71 # Generate final report
72 state.phase = ResearchPhase.COMPLETE
73 report = await self._generate_report(state)
74
75 return {
76 "question": question,
77 "report": report,
78 "sources": [s.summary() for s in state.sources],
79 "findings": [f.cite() for f in state.findings],
80 "metadata": state.summary()
81 }
82
83 async def _plan_research(
84 self,
85 state: ResearchState,
86 strategy: ResearchStrategy
87 ) -> None:
88 """Plan the research approach."""
89 # Generate search queries
90 searches = strategy.plan_searches(state.query)
91 state.search_queries_used.extend(searches)
92
93 # Move to searching phase
94 state.phase = ResearchPhase.SEARCHING
95
96 async def _search_phase(
97 self,
98 state: ResearchState,
99 strategy: ResearchStrategy
100 ) -> None:
101 """Execute searches to find sources."""
102 search_tool = self.tools.get("web_search")
103 if not search_tool:
104 state.phase = ResearchPhase.GATHERING
105 return
106
107 # Execute pending searches
108 for query in state.search_queries_used:
109 try:
110 results = await search_tool.execute(
111 query=query,
112 num_results=5
113 )
114
115 for result in results.get("results", []):
116 source = Source(
117 url=result["url"],
118 title=result["title"],
119 source_type=self._classify_source(result["url"]),
120 )
121 state.add_source(source)
122
123 except Exception as e:
124 state.errors.append({
125 "phase": "searching",
126 "query": query,
127 "error": str(e)
128 })
129
130 # Move to gathering phase
131 state.phase = ResearchPhase.GATHERING
132
133 async def _gather_phase(
134 self,
135 state: ResearchState,
136 strategy: ResearchStrategy
137 ) -> None:
138 """Gather content from sources."""
139 scrape_tool = self.tools.get("scrape_page")
140 if not scrape_tool:
141 state.phase = ResearchPhase.ANALYZING
142 return
143
144 # Prioritize sources
145 prioritized = strategy.prioritize_sources(state.sources)
146
147 # Gather content from top sources
148 for source in prioritized[:state.query.max_sources]:
149 if source.content: # Already gathered
150 continue
151
152 try:
153 result = await scrape_tool.execute(url=source.url)
154 source.content = result.get("content", "")
155 source.reliability_score = self._score_reliability(source)
156 except Exception as e:
157 state.errors.append({
158 "phase": "gathering",
159 "url": source.url,
160 "error": str(e)
161 })
162
163 # Move to analyzing phase
164 state.phase = ResearchPhase.ANALYZING
165
166 def _classify_source(self, url: str) -> SourceType:
167 """Classify a source based on its URL."""
168 url_lower = url.lower()
169
170 if "arxiv.org" in url_lower or "doi.org" in url_lower:
171 return SourceType.ACADEMIC_PAPER
172 elif ".pdf" in url_lower:
173 return SourceType.PDF
174 elif "news" in url_lower or "bbc" in url_lower or "cnn" in url_lower:
175 return SourceType.NEWS_ARTICLE
176 elif "docs." in url_lower or "documentation" in url_lower:
177 return SourceType.DOCUMENTATION
178 elif "twitter" in url_lower or "reddit" in url_lower:
179 return SourceType.SOCIAL_MEDIA
180 else:
181 return SourceType.WEB_PAGE
182
183 def _score_reliability(self, source: Source) -> float:
184 """Score the reliability of a source."""
185 score = 0.5 # Base score
186
187 # Adjust by source type
188 type_scores = {
189 SourceType.ACADEMIC_PAPER: 0.9,
190 SourceType.DOCUMENTATION: 0.8,
191 SourceType.NEWS_ARTICLE: 0.6,
192 SourceType.WEB_PAGE: 0.5,
193 SourceType.SOCIAL_MEDIA: 0.3,
194 }
195 score = type_scores.get(source.source_type, score)
196
197 # Boost for known reliable domains
198 reliable_domains = [".gov", ".edu", "wikipedia.org"]
199 for domain in reliable_domains:
200 if domain in source.url:
201 score = min(1.0, score + 0.1)
202
203 return scoreSummary
In this section, we established the foundation for building a research agent:
- Core Data Structures: Source, Finding, ResearchQuery, and ResearchState classes to track all research information
- Tool Registry: Extensible system for search, scrape, and synthesis tools
- Research Strategies: Different approaches for quick, comprehensive, and deep research
- Agent State Machine: Phased approach moving through planning, searching, gathering, analyzing, synthesizing, and reporting
- Source Classification: Automatic categorization and reliability scoring of sources
In the next section, we'll implement the web search integration that allows our agent to find relevant sources across the internet.