Introduction
Traditional software testing assumes deterministic behavior: given input X, expect output Y. Agents are differentβthey make decisions, use tools dynamically, and their outputs vary. Testing agents requires new strategies that embrace this non-determinism while still ensuring quality and reliability.
This section covers testing strategies from unit tests for components to evaluation frameworks for end-to-end behavior.
Core Principle: Test behavior, not exact outputs. An agent that achieves the goal through different paths is still correct. Focus on outcomes, constraints, and safety invariants.
Agent Testing Challenges
Agent testing differs from traditional testing in several ways:
| Challenge | Traditional Testing | Agent Testing |
|---|---|---|
| Determinism | Same input β same output | Same input β varied outputs |
| Dependencies | Mock external services | Mock LLM responses + tools |
| Correctness | Exact output matching | Behavioral & goal checking |
| Coverage | Code paths | Decision paths + edge cases |
| Performance | Speed, memory | Token usage, API calls, steps |
Testing Pyramid for Agents
1βββββββββββββββ
2 β E2E β Few: Full scenarios
3 β Tests β
4 βββ΄ββββββββββββββ΄ββ
5 β Integration β Some: Component combos
6 β Tests β
7 βββ΄ββββββββββββββββββ΄ββ
8 β Unit Tests β Many: Individual parts
9 βββββββββββββββββββββββUnit Testing Components
Test individual components in isolation with mocked dependencies:
Testing Tools
1import pytest
2from unittest.mock import AsyncMock, patch
3
4# Test tool execution
5@pytest.mark.asyncio
6async def test_web_search_tool():
7 """Test web search returns results."""
8 tool = WebSearchTool(api_key="test_key")
9
10 with patch("httpx.AsyncClient.get") as mock_get:
11 mock_get.return_value.json.return_value = {
12 "organic_results": [
13 {"title": "Result 1", "link": "http://example.com", "snippet": "..."}
14 ]
15 }
16
17 result = await tool.search("test query")
18
19 assert "Result 1" in result
20 assert "http://example.com" in result
21 mock_get.assert_called_once()
22
23@pytest.mark.asyncio
24async def test_tool_handles_api_error():
25 """Test tool handles API failures gracefully."""
26 tool = WebSearchTool(api_key="test_key")
27
28 with patch("httpx.AsyncClient.get") as mock_get:
29 mock_get.side_effect = Exception("API Error")
30
31 result = await tool.search("test query")
32
33 assert "Error" in result or "error" in result.lower()
34
35@pytest.mark.asyncio
36async def test_code_execution_timeout():
37 """Test code execution respects timeout."""
38 tool = CodeExecutionTool(timeout=1)
39
40 result = await tool.execute(
41 code="import time; time.sleep(10)",
42 language="python"
43 )
44
45 assert "timeout" in result.lower()Testing Memory
1@pytest.mark.asyncio
2async def test_working_memory_stores_results():
3 """Test working memory stores and retrieves results."""
4 memory = WorkingMemory()
5 memory.set_task("Test task")
6
7 memory.add_result("step1", "result1")
8 memory.add_result("step2", "result2")
9
10 assert memory.get_result("step1") == "result1"
11 assert memory.get_result("step2") == "result2"
12 assert "step1" in memory.get_context_string()
13
14@pytest.mark.asyncio
15async def test_conversation_memory_summarizes():
16 """Test conversation memory summarizes long history."""
17 memory = ConversationMemory(summarize_threshold=5)
18
19 for i in range(10):
20 memory.add_message("user", f"Message {i}")
21
22 messages = memory.get_messages_for_llm()
23
24 # Should have summary + recent messages, not all 10
25 assert len(messages) < 10
26 assert memory.summary is not None
27
28@pytest.mark.asyncio
29async def test_semantic_memory_search():
30 """Test semantic memory finds similar content."""
31 # Mock embedding function
32 async def mock_embed(text: str) -> list[float]:
33 # Simple mock: hash-based embedding
34 import hashlib
35 h = hashlib.md5(text.encode()).hexdigest()
36 return [int(c, 16) / 15.0 for c in h[:10]]
37
38 memory = SemanticMemory(embedding_fn=mock_embed)
39
40 await memory.add("Python is a programming language")
41 await memory.add("JavaScript runs in browsers")
42 await memory.add("Python is great for data science")
43
44 results = await memory.search("Python programming")
45
46 assert len(results) > 0
47 assert any("Python" in r.content for r in results)Testing Error Handling
1def test_error_classification():
2 """Test errors are classified correctly."""
3 timeout_error = Exception("Connection timed out")
4 auth_error = Exception("401 Unauthorized")
5 generic_error = Exception("Something went wrong")
6
7 assert classify_error(timeout_error).category == ErrorCategory.TRANSIENT
8 assert classify_error(auth_error).category == ErrorCategory.FATAL
9 assert classify_error(generic_error).recoverable == True
10
11@pytest.mark.asyncio
12async def test_circuit_breaker_opens():
13 """Test circuit breaker opens after failures."""
14 breaker = CircuitBreaker(failure_threshold=3)
15
16 for _ in range(3):
17 breaker.record_failure()
18
19 assert breaker.state == "open"
20 assert breaker.can_proceed() == False
21
22@pytest.mark.asyncio
23async def test_retry_with_backoff():
24 """Test retry eventually succeeds."""
25 attempts = 0
26
27 async def flaky_function():
28 nonlocal attempts
29 attempts += 1
30 if attempts < 3:
31 raise Exception("Temporary failure")
32 return "success"
33
34 result = await retry_with_backoff(
35 flaky_function,
36 max_retries=3,
37 base_delay=0.1
38 )
39
40 assert result == "success"
41 assert attempts == 3Integration Testing
Test components working together with mocked LLM responses:
Mocking LLM Responses
1from dataclasses import dataclass
2from typing import Any
3
4@dataclass
5class MockLLMResponse:
6 """Mock LLM response."""
7 content: list
8 stop_reason: str = "end_turn"
9
10 class Usage:
11 input_tokens = 100
12 output_tokens = 50
13
14 usage = Usage()
15
16class MockAnthropicClient:
17 """Mock Anthropic client for testing."""
18
19 def __init__(self, responses: list[dict]):
20 self.responses = responses
21 self.call_count = 0
22 self.calls: list[dict] = []
23
24 def create(self, **kwargs) -> MockLLMResponse:
25 """Return next mock response."""
26 self.calls.append(kwargs)
27
28 if self.call_count >= len(self.responses):
29 # Default final response
30 return MockLLMResponse(
31 content=[type("Block", (), {"type": "text", "text": "Done"})()],
32 stop_reason="end_turn"
33 )
34
35 response_data = self.responses[self.call_count]
36 self.call_count += 1
37
38 content = []
39 if "tool_use" in response_data:
40 content.append(type("Block", (), {
41 "type": "tool_use",
42 "id": response_data["tool_use"]["id"],
43 "name": response_data["tool_use"]["name"],
44 "input": response_data["tool_use"]["input"]
45 })())
46 if "text" in response_data:
47 content.append(type("Block", (), {
48 "type": "text",
49 "text": response_data["text"]
50 })())
51
52 return MockLLMResponse(
53 content=content,
54 stop_reason=response_data.get("stop_reason", "end_turn")
55 )
56
57
58@pytest.fixture
59def mock_client():
60 """Fixture for mock LLM client."""
61 def create_mock(responses: list[dict]):
62 return MockAnthropicClient(responses)
63 return create_mockTesting Agent Flows
1@pytest.mark.asyncio
2async def test_agent_uses_tool_correctly(mock_client):
3 """Test agent calls tool and uses result."""
4
5 # Mock responses: first use tool, then give final answer
6 responses = [
7 {
8 "tool_use": {
9 "id": "call_1",
10 "name": "calculate",
11 "input": {"expression": "2 + 2"}
12 }
13 },
14 {
15 "text": "The result is 4",
16 "stop_reason": "end_turn"
17 }
18 ]
19
20 client = mock_client(responses)
21
22 # Create agent with mock
23 agent = Agent(config=AgentConfig(max_steps=5))
24 agent.client = type("Client", (), {"messages": client})()
25 agent.register_tool("calculate", lambda expression: eval(expression), {
26 "description": "Calculate",
27 "input_schema": {"type": "object", "properties": {"expression": {"type": "string"}}}
28 })
29
30 result = await agent.run("What is 2 + 2?")
31
32 assert result["success"] == True
33 assert "4" in result["answer"]
34 assert client.call_count == 2
35
36@pytest.mark.asyncio
37async def test_agent_respects_max_steps(mock_client):
38 """Test agent stops at max steps."""
39
40 # Mock infinite tool calls
41 responses = [
42 {"tool_use": {"id": f"call_{i}", "name": "search", "input": {"query": "test"}}}
43 for i in range(100)
44 ]
45
46 client = mock_client(responses)
47 agent = Agent(config=AgentConfig(max_steps=5))
48 agent.client = type("Client", (), {"messages": client})()
49 agent.register_tool("search", lambda query: "results", {"description": "Search", "input_schema": {}})
50
51 result = await agent.run("Search forever")
52
53 assert result["status"] == "max_steps"
54 assert result["steps"] <= 5
55
56@pytest.mark.asyncio
57async def test_agent_recovers_from_tool_error(mock_client):
58 """Test agent handles tool errors gracefully."""
59
60 responses = [
61 {"tool_use": {"id": "call_1", "name": "broken_tool", "input": {}}},
62 {"text": "I encountered an error but here's what I can tell you...", "stop_reason": "end_turn"}
63 ]
64
65 client = mock_client(responses)
66 agent = Agent(config=AgentConfig(max_steps=5))
67 agent.client = type("Client", (), {"messages": client})()
68 agent.register_tool("broken_tool", lambda: (_ for _ in ()).throw(Exception("Broken")), {"description": "Broken", "input_schema": {}})
69
70 result = await agent.run("Use the broken tool")
71
72 assert result["success"] == True # Agent should still completeEvaluation Metrics
Beyond pass/fail tests, measure agent quality with metrics:
Key Metrics
| Metric | Description | Target |
|---|---|---|
| Task Success Rate | % of tasks completed correctly | >90% |
| Average Steps | Steps to complete tasks | Minimize |
| Tool Accuracy | Correct tool selection rate | >95% |
| Token Efficiency | Tokens per successful task | Minimize |
| Error Recovery Rate | % of errors recovered from | >80% |
Evaluation Framework
1from dataclasses import dataclass, field
2from typing import Callable, Any
3import statistics
4
5@dataclass
6class EvalCase:
7 """A single evaluation case."""
8 name: str
9 task: str
10 expected_outcome: Callable[[str], bool] # Validates answer
11 expected_tools: list[str] = None # Tools that should be used
12 max_steps: int = None
13 timeout_seconds: float = 60.0
14
15@dataclass
16class EvalResult:
17 """Result of evaluating a case."""
18 case_name: str
19 success: bool
20 steps: int
21 tokens_used: int
22 duration_seconds: float
23 tools_used: list[str]
24 error: str = None
25
26class AgentEvaluator:
27 """Evaluate agent performance."""
28
29 def __init__(self, agent: Agent):
30 self.agent = agent
31 self.results: list[EvalResult] = []
32
33 async def evaluate(self, cases: list[EvalCase]) -> dict:
34 """Run all evaluation cases."""
35
36 for case in cases:
37 result = await self._evaluate_case(case)
38 self.results.append(result)
39
40 return self._compute_metrics()
41
42 async def _evaluate_case(self, case: EvalCase) -> EvalResult:
43 """Evaluate a single case."""
44 import time
45 start = time.time()
46
47 try:
48 result = await self.agent.run(case.task)
49
50 # Check outcome
51 success = case.expected_outcome(result.get("answer", ""))
52
53 # Check tool usage if specified
54 if case.expected_tools:
55 tools_used = list(result.get("tool_results", {}).keys())
56 tools_correct = all(t in tools_used for t in case.expected_tools)
57 success = success and tools_correct
58
59 # Check step limit
60 if case.max_steps and result["steps"] > case.max_steps:
61 success = False
62
63 return EvalResult(
64 case_name=case.name,
65 success=success,
66 steps=result["steps"],
67 tokens_used=result.get("tokens_used", 0),
68 duration_seconds=time.time() - start,
69 tools_used=list(result.get("tool_results", {}).keys())
70 )
71
72 except Exception as e:
73 return EvalResult(
74 case_name=case.name,
75 success=False,
76 steps=0,
77 tokens_used=0,
78 duration_seconds=time.time() - start,
79 tools_used=[],
80 error=str(e)
81 )
82
83 def _compute_metrics(self) -> dict:
84 """Compute aggregate metrics."""
85 if not self.results:
86 return {}
87
88 successes = [r for r in self.results if r.success]
89 failures = [r for r in self.results if not r.success]
90
91 return {
92 "total_cases": len(self.results),
93 "success_rate": len(successes) / len(self.results),
94 "avg_steps": statistics.mean(r.steps for r in self.results),
95 "avg_tokens": statistics.mean(r.tokens_used for r in self.results if r.tokens_used > 0) if successes else 0,
96 "avg_duration": statistics.mean(r.duration_seconds for r in self.results),
97 "failures": [{"case": r.case_name, "error": r.error} for r in failures]
98 }Example Evaluation Suite
1# Define evaluation cases
2eval_cases = [
3 EvalCase(
4 name="simple_calculation",
5 task="What is 15% of 200?",
6 expected_outcome=lambda ans: "30" in ans,
7 expected_tools=["calculate"],
8 max_steps=3
9 ),
10 EvalCase(
11 name="web_search",
12 task="What is the capital of France?",
13 expected_outcome=lambda ans: "Paris" in ans,
14 expected_tools=["web_search"],
15 max_steps=5
16 ),
17 EvalCase(
18 name="multi_step",
19 task="Search for Python tutorials and summarize the top 3",
20 expected_outcome=lambda ans: len(ans) > 100,
21 expected_tools=["web_search"],
22 max_steps=10
23 ),
24]
25
26# Run evaluation
27async def run_evaluation():
28 agent = Agent(config=AgentConfig())
29 # ... register tools ...
30
31 evaluator = AgentEvaluator(agent)
32 metrics = await evaluator.evaluate(eval_cases)
33
34 print(f"Success Rate: {metrics['success_rate']:.1%}")
35 print(f"Avg Steps: {metrics['avg_steps']:.1f}")
36 print(f"Avg Tokens: {metrics['avg_tokens']:.0f}")
37
38 if metrics["failures"]:
39 print("\nFailures:")
40 for f in metrics["failures"]:
41 print(f" - {f['case']}: {f['error']}")Complete Test Framework
1"""
2Complete testing framework for agents.
3"""
4
5import pytest
6import asyncio
7from typing import Callable, Any
8
9# ============== Test Fixtures ==============
10
11@pytest.fixture
12def mock_tools():
13 """Fixture providing mock tools."""
14 return {
15 "calculate": lambda expression: str(eval(expression)),
16 "search": lambda query: f"Results for: {query}",
17 "read_file": lambda path: f"Contents of {path}",
18 }
19
20@pytest.fixture
21def agent_with_mocks(mock_tools):
22 """Fixture providing agent with mocked dependencies."""
23 agent = Agent(config=AgentConfig(max_steps=10))
24
25 for name, fn in mock_tools.items():
26 agent.register_tool(name, fn, {
27 "description": f"Mock {name}",
28 "input_schema": {"type": "object", "properties": {}}
29 })
30
31 return agent
32
33# ============== Test Markers ==============
34
35# Mark slow tests
36slow = pytest.mark.slow
37
38# Mark tests requiring real API
39requires_api = pytest.mark.skipif(
40 not os.getenv("ANTHROPIC_API_KEY"),
41 reason="Requires ANTHROPIC_API_KEY"
42)
43
44# ============== Test Classes ==============
45
46class TestAgentCore:
47 """Tests for core agent functionality."""
48
49 @pytest.mark.asyncio
50 async def test_agent_initializes(self, agent_with_mocks):
51 assert agent_with_mocks is not None
52 assert len(agent_with_mocks.tools) == 3
53
54 @pytest.mark.asyncio
55 async def test_agent_completes_simple_task(self, agent_with_mocks, mock_client):
56 responses = [
57 {"text": "The answer is 42", "stop_reason": "end_turn"}
58 ]
59 agent_with_mocks.client = type("C", (), {"messages": mock_client(responses)})()
60
61 result = await agent_with_mocks.run("Simple question")
62
63 assert result["success"] == True
64
65
66class TestAgentTools:
67 """Tests for tool functionality."""
68
69 @pytest.mark.asyncio
70 async def test_tool_execution(self, mock_tools):
71 result = mock_tools["calculate"]("2 + 2")
72 assert result == "4"
73
74 @pytest.mark.asyncio
75 async def test_tool_error_handling(self):
76 def bad_tool():
77 raise ValueError("Tool error")
78
79 tool = Tool(
80 name="bad",
81 description="Bad tool",
82 parameters=[],
83 function=bad_tool
84 )
85
86 result = await tool.execute()
87 assert "Error" in result
88
89
90class TestAgentMemory:
91 """Tests for memory functionality."""
92
93 @pytest.mark.asyncio
94 async def test_memory_persistence(self):
95 memory = AgentMemorySystem()
96 memory.working.set_task("Test")
97 memory.working.add_result("key", "value")
98
99 assert memory.working.get_result("key") == "value"
100
101
102class TestAgentEvaluation:
103 """Evaluation tests."""
104
105 @slow
106 @requires_api
107 @pytest.mark.asyncio
108 async def test_full_evaluation(self):
109 """Full evaluation with real API (slow)."""
110 agent = Agent()
111 # ... setup ...
112
113 evaluator = AgentEvaluator(agent)
114 metrics = await evaluator.evaluate(eval_cases)
115
116 assert metrics["success_rate"] >= 0.8
117
118
119# ============== Run Tests ==============
120
121if __name__ == "__main__":
122 pytest.main([__file__, "-v", "--asyncio-mode=auto"])Chapter Summary
Congratulations! You've built your first complete agent. In this chapter, we covered:
- Design decisions: When to build agents, architecture choices, and model selection
- Core loop: The think-act-observe-update cycle that powers agents
- Tools: Building, composing, and safely executing tools
- Memory: Working, conversation, and semantic memory systems
- Error handling: Retry, circuit breakers, recovery, and graceful degradation
- Testing: Unit tests, integration tests, and evaluation metrics
Chapter Complete: You now have all the pieces to build production-ready agents. In the next chapters, we'll build specialized agentsβa coding agent and a research agentβapplying these patterns to real-world use cases.
The next chapter dives into Building a Coding Agentβan agent that can write, execute, and debug code.