Chapter 11
15 min read
Section 70 of 175

Testing Your Agent

Building Your First Agent

Introduction

Traditional software testing assumes deterministic behavior: given input X, expect output Y. Agents are differentβ€”they make decisions, use tools dynamically, and their outputs vary. Testing agents requires new strategies that embrace this non-determinism while still ensuring quality and reliability.

This section covers testing strategies from unit tests for components to evaluation frameworks for end-to-end behavior.

Core Principle: Test behavior, not exact outputs. An agent that achieves the goal through different paths is still correct. Focus on outcomes, constraints, and safety invariants.

Agent Testing Challenges

Agent testing differs from traditional testing in several ways:

ChallengeTraditional TestingAgent Testing
DeterminismSame input β†’ same outputSame input β†’ varied outputs
DependenciesMock external servicesMock LLM responses + tools
CorrectnessExact output matchingBehavioral & goal checking
CoverageCode pathsDecision paths + edge cases
PerformanceSpeed, memoryToken usage, API calls, steps

Testing Pyramid for Agents

πŸ“text
1β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
2                    β”‚   E2E       β”‚  Few: Full scenarios
3                    β”‚   Tests     β”‚
4                  β”Œβ”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”
5                  β”‚   Integration   β”‚  Some: Component combos
6                  β”‚   Tests         β”‚
7                β”Œβ”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”
8                β”‚     Unit Tests      β”‚  Many: Individual parts
9                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Unit Testing Components

Test individual components in isolation with mocked dependencies:

Testing Tools

🐍python
1import pytest
2from unittest.mock import AsyncMock, patch
3
4# Test tool execution
5@pytest.mark.asyncio
6async def test_web_search_tool():
7    """Test web search returns results."""
8    tool = WebSearchTool(api_key="test_key")
9
10    with patch("httpx.AsyncClient.get") as mock_get:
11        mock_get.return_value.json.return_value = {
12            "organic_results": [
13                {"title": "Result 1", "link": "http://example.com", "snippet": "..."}
14            ]
15        }
16
17        result = await tool.search("test query")
18
19        assert "Result 1" in result
20        assert "http://example.com" in result
21        mock_get.assert_called_once()
22
23@pytest.mark.asyncio
24async def test_tool_handles_api_error():
25    """Test tool handles API failures gracefully."""
26    tool = WebSearchTool(api_key="test_key")
27
28    with patch("httpx.AsyncClient.get") as mock_get:
29        mock_get.side_effect = Exception("API Error")
30
31        result = await tool.search("test query")
32
33        assert "Error" in result or "error" in result.lower()
34
35@pytest.mark.asyncio
36async def test_code_execution_timeout():
37    """Test code execution respects timeout."""
38    tool = CodeExecutionTool(timeout=1)
39
40    result = await tool.execute(
41        code="import time; time.sleep(10)",
42        language="python"
43    )
44
45    assert "timeout" in result.lower()

Testing Memory

🐍python
1@pytest.mark.asyncio
2async def test_working_memory_stores_results():
3    """Test working memory stores and retrieves results."""
4    memory = WorkingMemory()
5    memory.set_task("Test task")
6
7    memory.add_result("step1", "result1")
8    memory.add_result("step2", "result2")
9
10    assert memory.get_result("step1") == "result1"
11    assert memory.get_result("step2") == "result2"
12    assert "step1" in memory.get_context_string()
13
14@pytest.mark.asyncio
15async def test_conversation_memory_summarizes():
16    """Test conversation memory summarizes long history."""
17    memory = ConversationMemory(summarize_threshold=5)
18
19    for i in range(10):
20        memory.add_message("user", f"Message {i}")
21
22    messages = memory.get_messages_for_llm()
23
24    # Should have summary + recent messages, not all 10
25    assert len(messages) < 10
26    assert memory.summary is not None
27
28@pytest.mark.asyncio
29async def test_semantic_memory_search():
30    """Test semantic memory finds similar content."""
31    # Mock embedding function
32    async def mock_embed(text: str) -> list[float]:
33        # Simple mock: hash-based embedding
34        import hashlib
35        h = hashlib.md5(text.encode()).hexdigest()
36        return [int(c, 16) / 15.0 for c in h[:10]]
37
38    memory = SemanticMemory(embedding_fn=mock_embed)
39
40    await memory.add("Python is a programming language")
41    await memory.add("JavaScript runs in browsers")
42    await memory.add("Python is great for data science")
43
44    results = await memory.search("Python programming")
45
46    assert len(results) > 0
47    assert any("Python" in r.content for r in results)

Testing Error Handling

🐍python
1def test_error_classification():
2    """Test errors are classified correctly."""
3    timeout_error = Exception("Connection timed out")
4    auth_error = Exception("401 Unauthorized")
5    generic_error = Exception("Something went wrong")
6
7    assert classify_error(timeout_error).category == ErrorCategory.TRANSIENT
8    assert classify_error(auth_error).category == ErrorCategory.FATAL
9    assert classify_error(generic_error).recoverable == True
10
11@pytest.mark.asyncio
12async def test_circuit_breaker_opens():
13    """Test circuit breaker opens after failures."""
14    breaker = CircuitBreaker(failure_threshold=3)
15
16    for _ in range(3):
17        breaker.record_failure()
18
19    assert breaker.state == "open"
20    assert breaker.can_proceed() == False
21
22@pytest.mark.asyncio
23async def test_retry_with_backoff():
24    """Test retry eventually succeeds."""
25    attempts = 0
26
27    async def flaky_function():
28        nonlocal attempts
29        attempts += 1
30        if attempts < 3:
31            raise Exception("Temporary failure")
32        return "success"
33
34    result = await retry_with_backoff(
35        flaky_function,
36        max_retries=3,
37        base_delay=0.1
38    )
39
40    assert result == "success"
41    assert attempts == 3

Integration Testing

Test components working together with mocked LLM responses:

Mocking LLM Responses

🐍python
1from dataclasses import dataclass
2from typing import Any
3
4@dataclass
5class MockLLMResponse:
6    """Mock LLM response."""
7    content: list
8    stop_reason: str = "end_turn"
9
10    class Usage:
11        input_tokens = 100
12        output_tokens = 50
13
14    usage = Usage()
15
16class MockAnthropicClient:
17    """Mock Anthropic client for testing."""
18
19    def __init__(self, responses: list[dict]):
20        self.responses = responses
21        self.call_count = 0
22        self.calls: list[dict] = []
23
24    def create(self, **kwargs) -> MockLLMResponse:
25        """Return next mock response."""
26        self.calls.append(kwargs)
27
28        if self.call_count >= len(self.responses):
29            # Default final response
30            return MockLLMResponse(
31                content=[type("Block", (), {"type": "text", "text": "Done"})()],
32                stop_reason="end_turn"
33            )
34
35        response_data = self.responses[self.call_count]
36        self.call_count += 1
37
38        content = []
39        if "tool_use" in response_data:
40            content.append(type("Block", (), {
41                "type": "tool_use",
42                "id": response_data["tool_use"]["id"],
43                "name": response_data["tool_use"]["name"],
44                "input": response_data["tool_use"]["input"]
45            })())
46        if "text" in response_data:
47            content.append(type("Block", (), {
48                "type": "text",
49                "text": response_data["text"]
50            })())
51
52        return MockLLMResponse(
53            content=content,
54            stop_reason=response_data.get("stop_reason", "end_turn")
55        )
56
57
58@pytest.fixture
59def mock_client():
60    """Fixture for mock LLM client."""
61    def create_mock(responses: list[dict]):
62        return MockAnthropicClient(responses)
63    return create_mock

Testing Agent Flows

🐍python
1@pytest.mark.asyncio
2async def test_agent_uses_tool_correctly(mock_client):
3    """Test agent calls tool and uses result."""
4
5    # Mock responses: first use tool, then give final answer
6    responses = [
7        {
8            "tool_use": {
9                "id": "call_1",
10                "name": "calculate",
11                "input": {"expression": "2 + 2"}
12            }
13        },
14        {
15            "text": "The result is 4",
16            "stop_reason": "end_turn"
17        }
18    ]
19
20    client = mock_client(responses)
21
22    # Create agent with mock
23    agent = Agent(config=AgentConfig(max_steps=5))
24    agent.client = type("Client", (), {"messages": client})()
25    agent.register_tool("calculate", lambda expression: eval(expression), {
26        "description": "Calculate",
27        "input_schema": {"type": "object", "properties": {"expression": {"type": "string"}}}
28    })
29
30    result = await agent.run("What is 2 + 2?")
31
32    assert result["success"] == True
33    assert "4" in result["answer"]
34    assert client.call_count == 2
35
36@pytest.mark.asyncio
37async def test_agent_respects_max_steps(mock_client):
38    """Test agent stops at max steps."""
39
40    # Mock infinite tool calls
41    responses = [
42        {"tool_use": {"id": f"call_{i}", "name": "search", "input": {"query": "test"}}}
43        for i in range(100)
44    ]
45
46    client = mock_client(responses)
47    agent = Agent(config=AgentConfig(max_steps=5))
48    agent.client = type("Client", (), {"messages": client})()
49    agent.register_tool("search", lambda query: "results", {"description": "Search", "input_schema": {}})
50
51    result = await agent.run("Search forever")
52
53    assert result["status"] == "max_steps"
54    assert result["steps"] <= 5
55
56@pytest.mark.asyncio
57async def test_agent_recovers_from_tool_error(mock_client):
58    """Test agent handles tool errors gracefully."""
59
60    responses = [
61        {"tool_use": {"id": "call_1", "name": "broken_tool", "input": {}}},
62        {"text": "I encountered an error but here's what I can tell you...", "stop_reason": "end_turn"}
63    ]
64
65    client = mock_client(responses)
66    agent = Agent(config=AgentConfig(max_steps=5))
67    agent.client = type("Client", (), {"messages": client})()
68    agent.register_tool("broken_tool", lambda: (_ for _ in ()).throw(Exception("Broken")), {"description": "Broken", "input_schema": {}})
69
70    result = await agent.run("Use the broken tool")
71
72    assert result["success"] == True  # Agent should still complete

Evaluation Metrics

Beyond pass/fail tests, measure agent quality with metrics:

Key Metrics

MetricDescriptionTarget
Task Success Rate% of tasks completed correctly>90%
Average StepsSteps to complete tasksMinimize
Tool AccuracyCorrect tool selection rate>95%
Token EfficiencyTokens per successful taskMinimize
Error Recovery Rate% of errors recovered from>80%

Evaluation Framework

🐍python
1from dataclasses import dataclass, field
2from typing import Callable, Any
3import statistics
4
5@dataclass
6class EvalCase:
7    """A single evaluation case."""
8    name: str
9    task: str
10    expected_outcome: Callable[[str], bool]  # Validates answer
11    expected_tools: list[str] = None  # Tools that should be used
12    max_steps: int = None
13    timeout_seconds: float = 60.0
14
15@dataclass
16class EvalResult:
17    """Result of evaluating a case."""
18    case_name: str
19    success: bool
20    steps: int
21    tokens_used: int
22    duration_seconds: float
23    tools_used: list[str]
24    error: str = None
25
26class AgentEvaluator:
27    """Evaluate agent performance."""
28
29    def __init__(self, agent: Agent):
30        self.agent = agent
31        self.results: list[EvalResult] = []
32
33    async def evaluate(self, cases: list[EvalCase]) -> dict:
34        """Run all evaluation cases."""
35
36        for case in cases:
37            result = await self._evaluate_case(case)
38            self.results.append(result)
39
40        return self._compute_metrics()
41
42    async def _evaluate_case(self, case: EvalCase) -> EvalResult:
43        """Evaluate a single case."""
44        import time
45        start = time.time()
46
47        try:
48            result = await self.agent.run(case.task)
49
50            # Check outcome
51            success = case.expected_outcome(result.get("answer", ""))
52
53            # Check tool usage if specified
54            if case.expected_tools:
55                tools_used = list(result.get("tool_results", {}).keys())
56                tools_correct = all(t in tools_used for t in case.expected_tools)
57                success = success and tools_correct
58
59            # Check step limit
60            if case.max_steps and result["steps"] > case.max_steps:
61                success = False
62
63            return EvalResult(
64                case_name=case.name,
65                success=success,
66                steps=result["steps"],
67                tokens_used=result.get("tokens_used", 0),
68                duration_seconds=time.time() - start,
69                tools_used=list(result.get("tool_results", {}).keys())
70            )
71
72        except Exception as e:
73            return EvalResult(
74                case_name=case.name,
75                success=False,
76                steps=0,
77                tokens_used=0,
78                duration_seconds=time.time() - start,
79                tools_used=[],
80                error=str(e)
81            )
82
83    def _compute_metrics(self) -> dict:
84        """Compute aggregate metrics."""
85        if not self.results:
86            return {}
87
88        successes = [r for r in self.results if r.success]
89        failures = [r for r in self.results if not r.success]
90
91        return {
92            "total_cases": len(self.results),
93            "success_rate": len(successes) / len(self.results),
94            "avg_steps": statistics.mean(r.steps for r in self.results),
95            "avg_tokens": statistics.mean(r.tokens_used for r in self.results if r.tokens_used > 0) if successes else 0,
96            "avg_duration": statistics.mean(r.duration_seconds for r in self.results),
97            "failures": [{"case": r.case_name, "error": r.error} for r in failures]
98        }

Example Evaluation Suite

🐍python
1# Define evaluation cases
2eval_cases = [
3    EvalCase(
4        name="simple_calculation",
5        task="What is 15% of 200?",
6        expected_outcome=lambda ans: "30" in ans,
7        expected_tools=["calculate"],
8        max_steps=3
9    ),
10    EvalCase(
11        name="web_search",
12        task="What is the capital of France?",
13        expected_outcome=lambda ans: "Paris" in ans,
14        expected_tools=["web_search"],
15        max_steps=5
16    ),
17    EvalCase(
18        name="multi_step",
19        task="Search for Python tutorials and summarize the top 3",
20        expected_outcome=lambda ans: len(ans) > 100,
21        expected_tools=["web_search"],
22        max_steps=10
23    ),
24]
25
26# Run evaluation
27async def run_evaluation():
28    agent = Agent(config=AgentConfig())
29    # ... register tools ...
30
31    evaluator = AgentEvaluator(agent)
32    metrics = await evaluator.evaluate(eval_cases)
33
34    print(f"Success Rate: {metrics['success_rate']:.1%}")
35    print(f"Avg Steps: {metrics['avg_steps']:.1f}")
36    print(f"Avg Tokens: {metrics['avg_tokens']:.0f}")
37
38    if metrics["failures"]:
39        print("\nFailures:")
40        for f in metrics["failures"]:
41            print(f"  - {f['case']}: {f['error']}")

Complete Test Framework

🐍python
1"""
2Complete testing framework for agents.
3"""
4
5import pytest
6import asyncio
7from typing import Callable, Any
8
9# ============== Test Fixtures ==============
10
11@pytest.fixture
12def mock_tools():
13    """Fixture providing mock tools."""
14    return {
15        "calculate": lambda expression: str(eval(expression)),
16        "search": lambda query: f"Results for: {query}",
17        "read_file": lambda path: f"Contents of {path}",
18    }
19
20@pytest.fixture
21def agent_with_mocks(mock_tools):
22    """Fixture providing agent with mocked dependencies."""
23    agent = Agent(config=AgentConfig(max_steps=10))
24
25    for name, fn in mock_tools.items():
26        agent.register_tool(name, fn, {
27            "description": f"Mock {name}",
28            "input_schema": {"type": "object", "properties": {}}
29        })
30
31    return agent
32
33# ============== Test Markers ==============
34
35# Mark slow tests
36slow = pytest.mark.slow
37
38# Mark tests requiring real API
39requires_api = pytest.mark.skipif(
40    not os.getenv("ANTHROPIC_API_KEY"),
41    reason="Requires ANTHROPIC_API_KEY"
42)
43
44# ============== Test Classes ==============
45
46class TestAgentCore:
47    """Tests for core agent functionality."""
48
49    @pytest.mark.asyncio
50    async def test_agent_initializes(self, agent_with_mocks):
51        assert agent_with_mocks is not None
52        assert len(agent_with_mocks.tools) == 3
53
54    @pytest.mark.asyncio
55    async def test_agent_completes_simple_task(self, agent_with_mocks, mock_client):
56        responses = [
57            {"text": "The answer is 42", "stop_reason": "end_turn"}
58        ]
59        agent_with_mocks.client = type("C", (), {"messages": mock_client(responses)})()
60
61        result = await agent_with_mocks.run("Simple question")
62
63        assert result["success"] == True
64
65
66class TestAgentTools:
67    """Tests for tool functionality."""
68
69    @pytest.mark.asyncio
70    async def test_tool_execution(self, mock_tools):
71        result = mock_tools["calculate"]("2 + 2")
72        assert result == "4"
73
74    @pytest.mark.asyncio
75    async def test_tool_error_handling(self):
76        def bad_tool():
77            raise ValueError("Tool error")
78
79        tool = Tool(
80            name="bad",
81            description="Bad tool",
82            parameters=[],
83            function=bad_tool
84        )
85
86        result = await tool.execute()
87        assert "Error" in result
88
89
90class TestAgentMemory:
91    """Tests for memory functionality."""
92
93    @pytest.mark.asyncio
94    async def test_memory_persistence(self):
95        memory = AgentMemorySystem()
96        memory.working.set_task("Test")
97        memory.working.add_result("key", "value")
98
99        assert memory.working.get_result("key") == "value"
100
101
102class TestAgentEvaluation:
103    """Evaluation tests."""
104
105    @slow
106    @requires_api
107    @pytest.mark.asyncio
108    async def test_full_evaluation(self):
109        """Full evaluation with real API (slow)."""
110        agent = Agent()
111        # ... setup ...
112
113        evaluator = AgentEvaluator(agent)
114        metrics = await evaluator.evaluate(eval_cases)
115
116        assert metrics["success_rate"] >= 0.8
117
118
119# ============== Run Tests ==============
120
121if __name__ == "__main__":
122    pytest.main([__file__, "-v", "--asyncio-mode=auto"])
Run fast unit tests on every commit, integration tests on PRs, and full evaluations nightly or before releases.

Chapter Summary

Congratulations! You've built your first complete agent. In this chapter, we covered:

  • Design decisions: When to build agents, architecture choices, and model selection
  • Core loop: The think-act-observe-update cycle that powers agents
  • Tools: Building, composing, and safely executing tools
  • Memory: Working, conversation, and semantic memory systems
  • Error handling: Retry, circuit breakers, recovery, and graceful degradation
  • Testing: Unit tests, integration tests, and evaluation metrics
Chapter Complete: You now have all the pieces to build production-ready agents. In the next chapters, we'll build specialized agentsβ€”a coding agent and a research agentβ€”applying these patterns to real-world use cases.

The next chapter dives into Building a Coding Agentβ€”an agent that can write, execute, and debug code.