Introduction
Effective evaluation is the foundation of building reliable AI agents. Without proper metrics, you cannot measure progress, identify regressions, or compare different approaches. This section introduces the key metrics categories for evaluating agentic systems and shows you how to implement them.
Why Metrics Matter: What gets measured gets improved. Clear, well-defined metrics enable data-driven decisions about agent development, deployment, and optimization.
Agent evaluation differs from traditional ML evaluation in several ways: agents perform multi-step tasks, interact with external systems, and must balance multiple objectives simultaneously. This requires a comprehensive metrics framework that captures these complexities.
Metric Categories
Agent metrics fall into five main categories, each capturing different aspects of agent behavior and performance:
| Category | Focus | Examples |
|---|---|---|
| Task Completion | Did the agent achieve the goal? | Success rate, partial completion |
| Quality | How good was the output? | Accuracy, relevance, coherence |
| Efficiency | How well did it use resources? | Latency, token usage, cost |
| Safety | Did it operate within bounds? | Guardrail triggers, violations |
| User Experience | How satisfied are users? | NPS, completion time, retries |
1"""
2Core metrics framework for AI agent evaluation.
3
4This module defines the foundational classes and interfaces
5for implementing agent evaluation metrics.
6"""
7
8from abc import ABC, abstractmethod
9from dataclasses import dataclass, field
10from datetime import datetime
11from enum import Enum
12from typing import Any, Dict, List, Optional, Generic, TypeVar
13import json
14
15
16class MetricCategory(Enum):
17 """Categories of evaluation metrics."""
18 TASK_COMPLETION = "task_completion"
19 QUALITY = "quality"
20 EFFICIENCY = "efficiency"
21 SAFETY = "safety"
22 USER_EXPERIENCE = "user_experience"
23
24
25@dataclass
26class MetricValue:
27 """A single metric measurement."""
28 name: str
29 value: float
30 category: MetricCategory
31 timestamp: datetime = field(default_factory=datetime.utcnow)
32 metadata: Dict[str, Any] = field(default_factory=dict)
33
34 def to_dict(self) -> Dict[str, Any]:
35 return {
36 "name": self.name,
37 "value": self.value,
38 "category": self.category.value,
39 "timestamp": self.timestamp.isoformat(),
40 "metadata": self.metadata
41 }
42
43
44T = TypeVar("T")
45
46
47class Metric(ABC, Generic[T]):
48 """Abstract base class for evaluation metrics."""
49
50 def __init__(self, name: str, category: MetricCategory):
51 self.name = name
52 self.category = category
53 self.values: List[MetricValue] = []
54
55 @abstractmethod
56 def compute(self, **kwargs) -> T:
57 """Compute the metric value."""
58 pass
59
60 def record(self, value: T, metadata: Optional[Dict[str, Any]] = None):
61 """Record a metric measurement."""
62 metric_value = MetricValue(
63 name=self.name,
64 value=float(value),
65 category=self.category,
66 metadata=metadata or {}
67 )
68 self.values.append(metric_value)
69 return metric_value
70
71 def get_latest(self) -> Optional[MetricValue]:
72 """Get the most recent measurement."""
73 return self.values[-1] if self.values else None
74
75 def get_average(self, window: Optional[int] = None) -> float:
76 """Get average value over recent measurements."""
77 if not self.values:
78 return 0.0
79
80 subset = self.values[-window:] if window else self.values
81 return sum(v.value for v in subset) / len(subset)
82
83
84@dataclass
85class EvaluationResult:
86 """Complete evaluation result for an agent task."""
87 task_id: str
88 agent_id: str
89 metrics: Dict[str, MetricValue]
90 timestamp: datetime = field(default_factory=datetime.utcnow)
91
92 def get_score(self, category: Optional[MetricCategory] = None) -> float:
93 """Calculate aggregate score, optionally filtered by category."""
94 relevant = [
95 m for m in self.metrics.values()
96 if category is None or m.category == category
97 ]
98
99 if not relevant:
100 return 0.0
101
102 return sum(m.value for m in relevant) / len(relevant)
103
104 def to_dict(self) -> Dict[str, Any]:
105 return {
106 "task_id": self.task_id,
107 "agent_id": self.agent_id,
108 "metrics": {k: v.to_dict() for k, v in self.metrics.items()},
109 "timestamp": self.timestamp.isoformat(),
110 "overall_score": self.get_score()
111 }Task Completion Metrics
Task completion metrics measure whether the agent successfully achieved its goal. These are the most fundamental metrics for any agent system:
1"""
2Task completion metrics for agent evaluation.
3"""
4
5from dataclasses import dataclass
6from typing import List, Optional, Set
7import difflib
8
9
10class SuccessRateMetric(Metric[float]):
11 """Measures the rate of successfully completed tasks."""
12
13 def __init__(self):
14 super().__init__("success_rate", MetricCategory.TASK_COMPLETION)
15 self.successes = 0
16 self.total = 0
17
18 def compute(self, success: bool, **kwargs) -> float:
19 """Record and compute success rate."""
20 self.total += 1
21 if success:
22 self.successes += 1
23
24 rate = self.successes / self.total if self.total > 0 else 0.0
25 self.record(rate, {"success": success, "total": self.total})
26 return rate
27
28
29class PartialCompletionMetric(Metric[float]):
30 """Measures partial task completion as a percentage."""
31
32 def __init__(self):
33 super().__init__("partial_completion", MetricCategory.TASK_COMPLETION)
34
35 def compute(
36 self,
37 completed_steps: int,
38 total_steps: int,
39 **kwargs
40 ) -> float:
41 """Calculate partial completion percentage."""
42 if total_steps == 0:
43 return 1.0
44
45 completion = completed_steps / total_steps
46 self.record(completion, {
47 "completed_steps": completed_steps,
48 "total_steps": total_steps
49 })
50 return completion
51
52
53class GoalAchievementMetric(Metric[float]):
54 """Measures how well the agent achieved its stated goal."""
55
56 def __init__(self, evaluator=None):
57 super().__init__("goal_achievement", MetricCategory.TASK_COMPLETION)
58 self.evaluator = evaluator # LLM evaluator for semantic comparison
59
60 def compute(
61 self,
62 goal: str,
63 outcome: str,
64 expected_outcome: Optional[str] = None,
65 **kwargs
66 ) -> float:
67 """Evaluate goal achievement."""
68
69 if expected_outcome:
70 # Compare against expected outcome
71 if self.evaluator:
72 score = self._semantic_similarity(outcome, expected_outcome)
73 else:
74 score = self._string_similarity(outcome, expected_outcome)
75 else:
76 # Use LLM to evaluate if outcome matches goal
77 if self.evaluator:
78 score = self._evaluate_goal_match(goal, outcome)
79 else:
80 # Fallback: check for goal keywords in outcome
81 score = self._keyword_match(goal, outcome)
82
83 self.record(score, {"goal": goal, "outcome": outcome[:200]})
84 return score
85
86 def _string_similarity(self, a: str, b: str) -> float:
87 """Calculate string similarity using difflib."""
88 return difflib.SequenceMatcher(None, a.lower(), b.lower()).ratio()
89
90 def _keyword_match(self, goal: str, outcome: str) -> float:
91 """Simple keyword matching for goal evaluation."""
92 goal_words = set(goal.lower().split())
93 outcome_words = set(outcome.lower().split())
94
95 # Remove common stop words
96 stop_words = {"the", "a", "an", "is", "are", "to", "for", "of", "and"}
97 goal_words -= stop_words
98 outcome_words -= stop_words
99
100 if not goal_words:
101 return 0.5
102
103 matches = len(goal_words & outcome_words)
104 return matches / len(goal_words)
105
106 def _semantic_similarity(self, outcome: str, expected: str) -> float:
107 """Use LLM to evaluate semantic similarity."""
108 prompt = f"""Rate the similarity between these two outputs on a scale of 0-1.
109
110Expected: {expected}
111
112Actual: {outcome}
113
114Return only a number between 0 and 1."""
115
116 response = self.evaluator.evaluate(prompt)
117 try:
118 return float(response.strip())
119 except ValueError:
120 return 0.5
121
122 def _evaluate_goal_match(self, goal: str, outcome: str) -> float:
123 """Use LLM to evaluate if outcome matches goal."""
124 prompt = f"""Evaluate how well this outcome achieves the stated goal.
125
126Goal: {goal}
127
128Outcome: {outcome}
129
130Rate from 0 (not achieved) to 1 (fully achieved). Return only a number."""
131
132 response = self.evaluator.evaluate(prompt)
133 try:
134 return float(response.strip())
135 except ValueError:
136 return 0.5
137
138
139class SubtaskCompletionMetric(Metric[float]):
140 """Measures completion of individual subtasks."""
141
142 def __init__(self):
143 super().__init__("subtask_completion", MetricCategory.TASK_COMPLETION)
144
145 def compute(
146 self,
147 subtasks: List[str],
148 completed: Set[str],
149 **kwargs
150 ) -> float:
151 """Calculate subtask completion rate."""
152 if not subtasks:
153 return 1.0
154
155 completion_rate = len(completed) / len(subtasks)
156
157 self.record(completion_rate, {
158 "total_subtasks": len(subtasks),
159 "completed_subtasks": len(completed),
160 "incomplete": [s for s in subtasks if s not in completed]
161 })
162
163 return completion_rateMeasuring Complex Task Success
For complex, multi-step tasks, you often need to combine multiple completion metrics. Here's how to create a composite task completion evaluator:
1"""
2Composite task completion evaluation.
3"""
4
5@dataclass
6class TaskCompletionResult:
7 """Result of task completion evaluation."""
8 overall_success: bool
9 completion_rate: float
10 goal_achievement: float
11 subtask_scores: Dict[str, float]
12 failure_reasons: List[str]
13
14 @property
15 def composite_score(self) -> float:
16 """Calculate weighted composite score."""
17 weights = {
18 "completion": 0.3,
19 "goal": 0.5,
20 "subtasks": 0.2
21 }
22
23 subtask_avg = (
24 sum(self.subtask_scores.values()) / len(self.subtask_scores)
25 if self.subtask_scores else 1.0
26 )
27
28 return (
29 weights["completion"] * self.completion_rate +
30 weights["goal"] * self.goal_achievement +
31 weights["subtasks"] * subtask_avg
32 )
33
34
35class TaskCompletionEvaluator:
36 """Comprehensive task completion evaluator."""
37
38 def __init__(self, llm_evaluator=None):
39 self.success_metric = SuccessRateMetric()
40 self.partial_metric = PartialCompletionMetric()
41 self.goal_metric = GoalAchievementMetric(llm_evaluator)
42 self.subtask_metric = SubtaskCompletionMetric()
43
44 def evaluate(
45 self,
46 task: Dict[str, Any],
47 result: Dict[str, Any]
48 ) -> TaskCompletionResult:
49 """Perform comprehensive task completion evaluation."""
50
51 failure_reasons = []
52
53 # Check explicit success flag
54 explicit_success = result.get("success", None)
55
56 # Calculate completion rate
57 completed_steps = result.get("completed_steps", 0)
58 total_steps = result.get("total_steps", 1)
59 completion_rate = self.partial_metric.compute(
60 completed_steps, total_steps
61 )
62
63 if completion_rate < 1.0:
64 failure_reasons.append(
65 f"Only {completed_steps}/{total_steps} steps completed"
66 )
67
68 # Evaluate goal achievement
69 goal = task.get("goal", "")
70 outcome = result.get("outcome", "")
71 expected = task.get("expected_outcome")
72
73 goal_achievement = self.goal_metric.compute(
74 goal=goal,
75 outcome=outcome,
76 expected_outcome=expected
77 )
78
79 if goal_achievement < 0.7:
80 failure_reasons.append(
81 f"Goal achievement score: {goal_achievement:.2f}"
82 )
83
84 # Evaluate subtasks
85 subtasks = task.get("subtasks", [])
86 completed_subtasks = set(result.get("completed_subtasks", []))
87 subtask_scores = {}
88
89 for subtask in subtasks:
90 subtask_id = subtask.get("id", subtask.get("name"))
91 if subtask_id in completed_subtasks:
92 subtask_scores[subtask_id] = 1.0
93 else:
94 subtask_scores[subtask_id] = 0.0
95 failure_reasons.append(f"Subtask not completed: {subtask_id}")
96
97 # Determine overall success
98 if explicit_success is not None:
99 overall_success = explicit_success
100 else:
101 overall_success = (
102 completion_rate >= 0.9 and
103 goal_achievement >= 0.7 and
104 len([s for s in subtask_scores.values() if s < 1.0]) == 0
105 )
106
107 return TaskCompletionResult(
108 overall_success=overall_success,
109 completion_rate=completion_rate,
110 goal_achievement=goal_achievement,
111 subtask_scores=subtask_scores,
112 failure_reasons=failure_reasons
113 )Quality Metrics
Quality metrics assess the correctness, relevance, and overall quality of agent outputs. These metrics often require domain-specific evaluation criteria:
1"""
2Quality metrics for agent evaluation.
3"""
4
5class AccuracyMetric(Metric[float]):
6 """Measures factual accuracy of agent responses."""
7
8 def __init__(self, fact_checker=None):
9 super().__init__("accuracy", MetricCategory.QUALITY)
10 self.fact_checker = fact_checker
11
12 def compute(
13 self,
14 claims: List[str],
15 ground_truth: Optional[List[str]] = None,
16 **kwargs
17 ) -> float:
18 """Evaluate accuracy of claims."""
19
20 if not claims:
21 return 1.0
22
23 verified = 0
24 verification_results = []
25
26 for claim in claims:
27 if ground_truth:
28 # Check against provided ground truth
29 is_accurate = any(
30 self._claim_matches(claim, truth)
31 for truth in ground_truth
32 )
33 elif self.fact_checker:
34 # Use external fact checker
35 is_accurate = self.fact_checker.verify(claim)
36 else:
37 # Cannot verify
38 is_accurate = True
39
40 if is_accurate:
41 verified += 1
42
43 verification_results.append({
44 "claim": claim,
45 "verified": is_accurate
46 })
47
48 accuracy = verified / len(claims)
49 self.record(accuracy, {"results": verification_results})
50 return accuracy
51
52 def _claim_matches(self, claim: str, truth: str) -> bool:
53 """Check if claim matches truth."""
54 claim_lower = claim.lower()
55 truth_lower = truth.lower()
56
57 # Simple containment check
58 return truth_lower in claim_lower or claim_lower in truth_lower
59
60
61class RelevanceMetric(Metric[float]):
62 """Measures relevance of response to the query."""
63
64 def __init__(self, evaluator=None):
65 super().__init__("relevance", MetricCategory.QUALITY)
66 self.evaluator = evaluator
67
68 def compute(
69 self,
70 query: str,
71 response: str,
72 **kwargs
73 ) -> float:
74 """Evaluate relevance of response to query."""
75
76 if self.evaluator:
77 # Use LLM for semantic relevance evaluation
78 prompt = f"""Rate the relevance of this response to the query.
79
80Query: {query}
81
82Response: {response}
83
84Rate from 0 (completely irrelevant) to 1 (highly relevant).
85Return only a number."""
86
87 result = self.evaluator.evaluate(prompt)
88 try:
89 score = float(result.strip())
90 except ValueError:
91 score = 0.5
92 else:
93 # Fallback: keyword overlap
94 query_words = set(query.lower().split())
95 response_words = set(response.lower().split())
96
97 overlap = len(query_words & response_words)
98 score = min(1.0, overlap / max(len(query_words), 1))
99
100 self.record(score, {"query": query[:100], "response": response[:200]})
101 return score
102
103
104class CoherenceMetric(Metric[float]):
105 """Measures logical coherence and consistency."""
106
107 def __init__(self, evaluator=None):
108 super().__init__("coherence", MetricCategory.QUALITY)
109 self.evaluator = evaluator
110
111 def compute(
112 self,
113 text: str,
114 context: Optional[str] = None,
115 **kwargs
116 ) -> float:
117 """Evaluate coherence of text."""
118
119 if self.evaluator:
120 prompt = f"""Evaluate the logical coherence of this text.
121
122Text: {text}
123
124{f"Context: {context}" if context else ""}
125
126Consider:
1271. Logical flow of ideas
1282. Internal consistency
1293. Clarity of expression
130
131Rate from 0 (incoherent) to 1 (perfectly coherent).
132Return only a number."""
133
134 result = self.evaluator.evaluate(prompt)
135 try:
136 score = float(result.strip())
137 except ValueError:
138 score = 0.5
139 else:
140 # Simple heuristics for coherence
141 sentences = text.split(".")
142
143 # Check for very short or very long sentences
144 sentence_lengths = [len(s.split()) for s in sentences if s.strip()]
145 if not sentence_lengths:
146 score = 0.5
147 else:
148 avg_length = sum(sentence_lengths) / len(sentence_lengths)
149 length_score = 1.0 if 5 <= avg_length <= 25 else 0.7
150
151 # Check for repetition
152 words = text.lower().split()
153 unique_ratio = len(set(words)) / len(words) if words else 0
154 repetition_score = min(1.0, unique_ratio * 1.5)
155
156 score = (length_score + repetition_score) / 2
157
158 self.record(score, {"text_length": len(text)})
159 return score
160
161
162class CompletenessMetric(Metric[float]):
163 """Measures how complete the response is."""
164
165 def __init__(self, evaluator=None):
166 super().__init__("completeness", MetricCategory.QUALITY)
167 self.evaluator = evaluator
168
169 def compute(
170 self,
171 query: str,
172 response: str,
173 required_elements: Optional[List[str]] = None,
174 **kwargs
175 ) -> float:
176 """Evaluate completeness of response."""
177
178 if required_elements:
179 # Check for required elements
180 found = sum(
181 1 for elem in required_elements
182 if elem.lower() in response.lower()
183 )
184 score = found / len(required_elements)
185 elif self.evaluator:
186 prompt = f"""Evaluate the completeness of this response.
187
188Query: {query}
189
190Response: {response}
191
192Does the response fully address all aspects of the query?
193Rate from 0 (incomplete) to 1 (complete).
194Return only a number."""
195
196 result = self.evaluator.evaluate(prompt)
197 try:
198 score = float(result.strip())
199 except ValueError:
200 score = 0.5
201 else:
202 # Heuristic: longer responses tend to be more complete
203 word_count = len(response.split())
204 score = min(1.0, word_count / 100)
205
206 self.record(score, {"query": query[:100]})
207 return scoreEfficiency Metrics
Efficiency metrics track resource utilization and operational costs. These metrics are critical for production deployments:
1"""
2Efficiency metrics for agent evaluation.
3"""
4
5import time
6from contextlib import contextmanager
7
8
9class LatencyMetric(Metric[float]):
10 """Measures response latency in milliseconds."""
11
12 def __init__(self):
13 super().__init__("latency", MetricCategory.EFFICIENCY)
14
15 def compute(
16 self,
17 start_time: float,
18 end_time: float,
19 **kwargs
20 ) -> float:
21 """Calculate latency from timestamps."""
22 latency_ms = (end_time - start_time) * 1000
23 self.record(latency_ms, {"start": start_time, "end": end_time})
24 return latency_ms
25
26 @contextmanager
27 def measure(self):
28 """Context manager for measuring latency."""
29 start = time.time()
30 yield
31 end = time.time()
32 self.compute(start, end)
33
34
35class TokenUsageMetric(Metric[int]):
36 """Measures token consumption."""
37
38 def __init__(self):
39 super().__init__("token_usage", MetricCategory.EFFICIENCY)
40 self.total_input = 0
41 self.total_output = 0
42
43 def compute(
44 self,
45 input_tokens: int,
46 output_tokens: int,
47 **kwargs
48 ) -> int:
49 """Record token usage."""
50 total = input_tokens + output_tokens
51 self.total_input += input_tokens
52 self.total_output += output_tokens
53
54 self.record(total, {
55 "input_tokens": input_tokens,
56 "output_tokens": output_tokens,
57 "cumulative_input": self.total_input,
58 "cumulative_output": self.total_output
59 })
60
61 return total
62
63
64class CostMetric(Metric[float]):
65 """Measures monetary cost of operations."""
66
67 def __init__(self, pricing: Dict[str, float] = None):
68 super().__init__("cost", MetricCategory.EFFICIENCY)
69 self.pricing = pricing or {
70 "input_token": 0.00001, # $0.01 per 1K tokens
71 "output_token": 0.00003, # $0.03 per 1K tokens
72 "api_call": 0.001,
73 "tool_use": 0.0001
74 }
75 self.total_cost = 0.0
76
77 def compute(
78 self,
79 input_tokens: int = 0,
80 output_tokens: int = 0,
81 api_calls: int = 0,
82 tool_uses: int = 0,
83 **kwargs
84 ) -> float:
85 """Calculate cost of operations."""
86 cost = (
87 input_tokens * self.pricing["input_token"] +
88 output_tokens * self.pricing["output_token"] +
89 api_calls * self.pricing["api_call"] +
90 tool_uses * self.pricing["tool_use"]
91 )
92
93 self.total_cost += cost
94
95 self.record(cost, {
96 "input_tokens": input_tokens,
97 "output_tokens": output_tokens,
98 "api_calls": api_calls,
99 "tool_uses": tool_uses,
100 "cumulative_cost": self.total_cost
101 })
102
103 return cost
104
105
106class StepCountMetric(Metric[int]):
107 """Measures number of steps/iterations to complete task."""
108
109 def __init__(self):
110 super().__init__("step_count", MetricCategory.EFFICIENCY)
111
112 def compute(self, steps: int, **kwargs) -> int:
113 """Record step count."""
114 self.record(steps)
115 return steps
116
117
118class ToolEfficiencyMetric(Metric[float]):
119 """Measures efficiency of tool usage."""
120
121 def __init__(self):
122 super().__init__("tool_efficiency", MetricCategory.EFFICIENCY)
123
124 def compute(
125 self,
126 successful_tool_calls: int,
127 total_tool_calls: int,
128 redundant_calls: int = 0,
129 **kwargs
130 ) -> float:
131 """Calculate tool usage efficiency."""
132 if total_tool_calls == 0:
133 return 1.0
134
135 # Success rate
136 success_rate = successful_tool_calls / total_tool_calls
137
138 # Redundancy penalty
139 redundancy_penalty = redundant_calls / total_tool_calls
140
141 efficiency = success_rate * (1 - redundancy_penalty * 0.5)
142
143 self.record(efficiency, {
144 "successful": successful_tool_calls,
145 "total": total_tool_calls,
146 "redundant": redundant_calls
147 })
148
149 return efficiency
150
151
152class EfficiencyEvaluator:
153 """Comprehensive efficiency evaluator."""
154
155 def __init__(self, pricing: Dict[str, float] = None):
156 self.latency_metric = LatencyMetric()
157 self.token_metric = TokenUsageMetric()
158 self.cost_metric = CostMetric(pricing)
159 self.step_metric = StepCountMetric()
160 self.tool_metric = ToolEfficiencyMetric()
161
162 def evaluate(
163 self,
164 execution_trace: Dict[str, Any]
165 ) -> Dict[str, float]:
166 """Evaluate efficiency from execution trace."""
167
168 results = {}
169
170 # Latency
171 if "start_time" in execution_trace and "end_time" in execution_trace:
172 results["latency_ms"] = self.latency_metric.compute(
173 execution_trace["start_time"],
174 execution_trace["end_time"]
175 )
176
177 # Token usage
178 if "input_tokens" in execution_trace:
179 results["total_tokens"] = self.token_metric.compute(
180 execution_trace.get("input_tokens", 0),
181 execution_trace.get("output_tokens", 0)
182 )
183
184 # Cost
185 results["cost"] = self.cost_metric.compute(
186 input_tokens=execution_trace.get("input_tokens", 0),
187 output_tokens=execution_trace.get("output_tokens", 0),
188 api_calls=execution_trace.get("api_calls", 0),
189 tool_uses=execution_trace.get("tool_uses", 0)
190 )
191
192 # Steps
193 if "steps" in execution_trace:
194 results["steps"] = self.step_metric.compute(
195 len(execution_trace["steps"])
196 )
197
198 # Tool efficiency
199 tool_calls = execution_trace.get("tool_calls", [])
200 if tool_calls:
201 successful = sum(1 for t in tool_calls if t.get("success", True))
202 redundant = sum(1 for t in tool_calls if t.get("redundant", False))
203 results["tool_efficiency"] = self.tool_metric.compute(
204 successful, len(tool_calls), redundant
205 )
206
207 return resultsSafety Metrics
Safety metrics ensure the agent operates within acceptable boundaries and doesn't cause harm. These metrics are critical for production systems:
1"""
2Safety metrics for agent evaluation.
3"""
4
5class GuardrailTriggerMetric(Metric[int]):
6 """Counts guardrail activations."""
7
8 def __init__(self):
9 super().__init__("guardrail_triggers", MetricCategory.SAFETY)
10 self.triggers_by_type: Dict[str, int] = {}
11
12 def compute(
13 self,
14 guardrail_type: str,
15 severity: str = "medium",
16 **kwargs
17 ) -> int:
18 """Record a guardrail trigger."""
19 self.triggers_by_type[guardrail_type] = (
20 self.triggers_by_type.get(guardrail_type, 0) + 1
21 )
22
23 total = sum(self.triggers_by_type.values())
24
25 self.record(total, {
26 "guardrail_type": guardrail_type,
27 "severity": severity,
28 "by_type": self.triggers_by_type.copy()
29 })
30
31 return total
32
33
34class ViolationRateMetric(Metric[float]):
35 """Measures rate of policy violations."""
36
37 def __init__(self):
38 super().__init__("violation_rate", MetricCategory.SAFETY)
39 self.violations = 0
40 self.total_actions = 0
41
42 def compute(
43 self,
44 is_violation: bool,
45 violation_type: Optional[str] = None,
46 **kwargs
47 ) -> float:
48 """Record action and calculate violation rate."""
49 self.total_actions += 1
50 if is_violation:
51 self.violations += 1
52
53 rate = self.violations / self.total_actions
54
55 self.record(rate, {
56 "is_violation": is_violation,
57 "violation_type": violation_type,
58 "total_violations": self.violations,
59 "total_actions": self.total_actions
60 })
61
62 return rate
63
64
65class BoundaryRespectMetric(Metric[float]):
66 """Measures how well the agent respects operational boundaries."""
67
68 def __init__(self, boundaries: Dict[str, Any] = None):
69 super().__init__("boundary_respect", MetricCategory.SAFETY)
70 self.boundaries = boundaries or {}
71
72 def compute(
73 self,
74 action: str,
75 parameters: Dict[str, Any],
76 **kwargs
77 ) -> float:
78 """Check if action respects boundaries."""
79
80 violations = []
81
82 # Check action boundaries
83 if "allowed_actions" in self.boundaries:
84 if action not in self.boundaries["allowed_actions"]:
85 violations.append(f"Action {action} not in allowed list")
86
87 if "blocked_actions" in self.boundaries:
88 if action in self.boundaries["blocked_actions"]:
89 violations.append(f"Action {action} is blocked")
90
91 # Check parameter boundaries
92 for param, value in parameters.items():
93 param_bounds = self.boundaries.get(f"{action}.{param}", {})
94
95 if "max" in param_bounds and value > param_bounds["max"]:
96 violations.append(f"{param} exceeds max: {value} > {param_bounds['max']}")
97
98 if "min" in param_bounds and value < param_bounds["min"]:
99 violations.append(f"{param} below min: {value} < {param_bounds['min']}")
100
101 score = 1.0 if not violations else 0.0
102
103 self.record(score, {
104 "action": action,
105 "violations": violations
106 })
107
108 return score
109
110
111class HarmfulOutputMetric(Metric[float]):
112 """Detects potentially harmful outputs."""
113
114 def __init__(self, classifier=None):
115 super().__init__("harmful_output", MetricCategory.SAFETY)
116 self.classifier = classifier
117 self.harm_categories = [
118 "violence", "hate_speech", "self_harm",
119 "sexual_content", "dangerous_instructions"
120 ]
121
122 def compute(
123 self,
124 output: str,
125 **kwargs
126 ) -> float:
127 """Check output for harmful content."""
128
129 if self.classifier:
130 # Use ML classifier
131 results = self.classifier.classify(output)
132 harm_scores = {
133 cat: results.get(cat, 0.0)
134 for cat in self.harm_categories
135 }
136 max_harm = max(harm_scores.values()) if harm_scores else 0.0
137 else:
138 # Simple keyword matching (not recommended for production)
139 harm_keywords = {
140 "violence": ["kill", "attack", "weapon", "bomb"],
141 "hate_speech": ["hate", "slur"], # Simplified
142 "dangerous_instructions": ["hack", "exploit", "bypass"]
143 }
144
145 output_lower = output.lower()
146 harm_scores = {}
147
148 for category, keywords in harm_keywords.items():
149 matches = sum(1 for k in keywords if k in output_lower)
150 harm_scores[category] = min(1.0, matches * 0.25)
151
152 max_harm = max(harm_scores.values()) if harm_scores else 0.0
153
154 # Score is 1.0 for safe, 0.0 for harmful
155 safety_score = 1.0 - max_harm
156
157 self.record(safety_score, {
158 "harm_scores": harm_scores,
159 "max_harm_category": max(harm_scores, key=harm_scores.get) if harm_scores else None
160 })
161
162 return safety_score
163
164
165class SafetyEvaluator:
166 """Comprehensive safety evaluator."""
167
168 def __init__(self, boundaries: Dict[str, Any] = None, classifier=None):
169 self.guardrail_metric = GuardrailTriggerMetric()
170 self.violation_metric = ViolationRateMetric()
171 self.boundary_metric = BoundaryRespectMetric(boundaries)
172 self.harm_metric = HarmfulOutputMetric(classifier)
173
174 def evaluate(
175 self,
176 action: str,
177 parameters: Dict[str, Any],
178 output: str,
179 guardrail_triggers: List[Dict[str, str]] = None
180 ) -> Dict[str, float]:
181 """Comprehensive safety evaluation."""
182
183 results = {}
184
185 # Record guardrail triggers
186 for trigger in (guardrail_triggers or []):
187 self.guardrail_metric.compute(
188 trigger["type"],
189 trigger.get("severity", "medium")
190 )
191 results["guardrail_triggers"] = len(guardrail_triggers or [])
192
193 # Check boundary respect
194 results["boundary_respect"] = self.boundary_metric.compute(
195 action, parameters
196 )
197
198 # Check for harmful output
199 results["safety_score"] = self.harm_metric.compute(output)
200
201 # Record violation if any safety check failed
202 is_violation = (
203 results["boundary_respect"] < 1.0 or
204 results["safety_score"] < 0.9
205 )
206 results["violation_rate"] = self.violation_metric.compute(is_violation)
207
208 return resultsMetrics Implementation
Now let's bring all metrics together into a unified evaluation framework:
1"""
2Unified metrics collection and reporting.
3"""
4
5from dataclasses import dataclass, field
6from typing import Any, Dict, List, Optional
7from datetime import datetime
8import json
9
10
11@dataclass
12class MetricsReport:
13 """Complete metrics report for an evaluation run."""
14 run_id: str
15 agent_id: str
16 timestamp: datetime
17 task_completion: Dict[str, float]
18 quality: Dict[str, float]
19 efficiency: Dict[str, float]
20 safety: Dict[str, float]
21 metadata: Dict[str, Any] = field(default_factory=dict)
22
23 @property
24 def overall_score(self) -> float:
25 """Calculate weighted overall score."""
26 weights = {
27 "task_completion": 0.35,
28 "quality": 0.30,
29 "efficiency": 0.15,
30 "safety": 0.20
31 }
32
33 scores = {
34 "task_completion": self._avg(self.task_completion),
35 "quality": self._avg(self.quality),
36 "efficiency": self._avg(self.efficiency),
37 "safety": self._avg(self.safety)
38 }
39
40 return sum(
41 weights[cat] * scores[cat]
42 for cat in weights
43 )
44
45 def _avg(self, metrics: Dict[str, float]) -> float:
46 if not metrics:
47 return 0.0
48 return sum(metrics.values()) / len(metrics)
49
50 def to_dict(self) -> Dict[str, Any]:
51 return {
52 "run_id": self.run_id,
53 "agent_id": self.agent_id,
54 "timestamp": self.timestamp.isoformat(),
55 "overall_score": self.overall_score,
56 "task_completion": self.task_completion,
57 "quality": self.quality,
58 "efficiency": self.efficiency,
59 "safety": self.safety,
60 "metadata": self.metadata
61 }
62
63 def to_json(self) -> str:
64 return json.dumps(self.to_dict(), indent=2)
65
66
67class MetricsCollector:
68 """Collects and aggregates metrics across evaluations."""
69
70 def __init__(
71 self,
72 agent_id: str,
73 llm_evaluator=None,
74 safety_boundaries: Dict[str, Any] = None,
75 harm_classifier=None
76 ):
77 self.agent_id = agent_id
78
79 # Initialize evaluators
80 self.task_evaluator = TaskCompletionEvaluator(llm_evaluator)
81 self.quality_metrics = {
82 "accuracy": AccuracyMetric(),
83 "relevance": RelevanceMetric(llm_evaluator),
84 "coherence": CoherenceMetric(llm_evaluator),
85 "completeness": CompletenessMetric(llm_evaluator)
86 }
87 self.efficiency_evaluator = EfficiencyEvaluator()
88 self.safety_evaluator = SafetyEvaluator(
89 safety_boundaries, harm_classifier
90 )
91
92 # Store reports
93 self.reports: List[MetricsReport] = []
94
95 def evaluate(
96 self,
97 task: Dict[str, Any],
98 result: Dict[str, Any],
99 execution_trace: Dict[str, Any]
100 ) -> MetricsReport:
101 """Run complete evaluation and generate report."""
102
103 run_id = f"{self.agent_id}_{datetime.utcnow().strftime('%Y%m%d_%H%M%S')}"
104
105 # Task completion
106 task_result = self.task_evaluator.evaluate(task, result)
107 task_completion = {
108 "success": float(task_result.overall_success),
109 "completion_rate": task_result.completion_rate,
110 "goal_achievement": task_result.goal_achievement,
111 "composite_score": task_result.composite_score
112 }
113
114 # Quality metrics
115 quality = {}
116 output = result.get("output", "")
117 query = task.get("goal", "")
118
119 quality["relevance"] = self.quality_metrics["relevance"].compute(
120 query, output
121 )
122 quality["coherence"] = self.quality_metrics["coherence"].compute(
123 output
124 )
125 quality["completeness"] = self.quality_metrics["completeness"].compute(
126 query, output
127 )
128
129 # Efficiency metrics
130 efficiency = self.efficiency_evaluator.evaluate(execution_trace)
131
132 # Normalize efficiency scores (lower is better for some metrics)
133 if "latency_ms" in efficiency:
134 # Convert latency to score (assuming 5000ms is bad)
135 efficiency["latency_score"] = max(
136 0, 1 - efficiency["latency_ms"] / 5000
137 )
138
139 if "cost" in efficiency:
140 # Convert cost to score (assuming $0.10 is expensive)
141 efficiency["cost_score"] = max(
142 0, 1 - efficiency["cost"] / 0.10
143 )
144
145 # Safety metrics
146 last_action = execution_trace.get("actions", [{}])[-1]
147 safety = self.safety_evaluator.evaluate(
148 action=last_action.get("name", ""),
149 parameters=last_action.get("parameters", {}),
150 output=output,
151 guardrail_triggers=execution_trace.get("guardrail_triggers", [])
152 )
153
154 # Create report
155 report = MetricsReport(
156 run_id=run_id,
157 agent_id=self.agent_id,
158 timestamp=datetime.utcnow(),
159 task_completion=task_completion,
160 quality=quality,
161 efficiency=efficiency,
162 safety=safety,
163 metadata={
164 "task_id": task.get("id"),
165 "task_type": task.get("type"),
166 "failure_reasons": task_result.failure_reasons
167 }
168 )
169
170 self.reports.append(report)
171 return report
172
173 def get_aggregate_report(self) -> Dict[str, Any]:
174 """Get aggregated metrics across all evaluations."""
175 if not self.reports:
176 return {}
177
178 def aggregate_category(reports: List[MetricsReport], accessor) -> Dict[str, float]:
179 all_metrics = [accessor(r) for r in reports]
180 all_keys = set()
181 for m in all_metrics:
182 all_keys.update(m.keys())
183
184 return {
185 key: sum(m.get(key, 0) for m in all_metrics) / len(all_metrics)
186 for key in all_keys
187 }
188
189 return {
190 "total_evaluations": len(self.reports),
191 "average_overall_score": sum(r.overall_score for r in self.reports) / len(self.reports),
192 "task_completion": aggregate_category(self.reports, lambda r: r.task_completion),
193 "quality": aggregate_category(self.reports, lambda r: r.quality),
194 "efficiency": aggregate_category(self.reports, lambda r: r.efficiency),
195 "safety": aggregate_category(self.reports, lambda r: r.safety)
196 }Summary
This section introduced the key metrics categories for evaluating AI agents:
| Category | Key Metrics | Purpose |
|---|---|---|
| Task Completion | Success rate, goal achievement, partial completion | Did it work? |
| Quality | Accuracy, relevance, coherence, completeness | How good is it? |
| Efficiency | Latency, tokens, cost, steps | How efficient is it? |
| Safety | Guardrail triggers, violations, harm detection | Is it safe? |
Key Takeaways: Effective evaluation requires a comprehensive metrics framework that captures task success, output quality, resource efficiency, and safety. Use automated metrics where possible, but incorporate LLM-based evaluation for semantic assessments.
In the next section, we'll explore how to design benchmarks that effectively test agent capabilities across different scenarios.