Chapter 20
15 min read
Section 123 of 175

Evaluation Metrics

Evaluation and Benchmarking

Introduction

Effective evaluation is the foundation of building reliable AI agents. Without proper metrics, you cannot measure progress, identify regressions, or compare different approaches. This section introduces the key metrics categories for evaluating agentic systems and shows you how to implement them.

Why Metrics Matter: What gets measured gets improved. Clear, well-defined metrics enable data-driven decisions about agent development, deployment, and optimization.

Agent evaluation differs from traditional ML evaluation in several ways: agents perform multi-step tasks, interact with external systems, and must balance multiple objectives simultaneously. This requires a comprehensive metrics framework that captures these complexities.


Metric Categories

Agent metrics fall into five main categories, each capturing different aspects of agent behavior and performance:

CategoryFocusExamples
Task CompletionDid the agent achieve the goal?Success rate, partial completion
QualityHow good was the output?Accuracy, relevance, coherence
EfficiencyHow well did it use resources?Latency, token usage, cost
SafetyDid it operate within bounds?Guardrail triggers, violations
User ExperienceHow satisfied are users?NPS, completion time, retries
🐍python
1"""
2Core metrics framework for AI agent evaluation.
3
4This module defines the foundational classes and interfaces
5for implementing agent evaluation metrics.
6"""
7
8from abc import ABC, abstractmethod
9from dataclasses import dataclass, field
10from datetime import datetime
11from enum import Enum
12from typing import Any, Dict, List, Optional, Generic, TypeVar
13import json
14
15
16class MetricCategory(Enum):
17    """Categories of evaluation metrics."""
18    TASK_COMPLETION = "task_completion"
19    QUALITY = "quality"
20    EFFICIENCY = "efficiency"
21    SAFETY = "safety"
22    USER_EXPERIENCE = "user_experience"
23
24
25@dataclass
26class MetricValue:
27    """A single metric measurement."""
28    name: str
29    value: float
30    category: MetricCategory
31    timestamp: datetime = field(default_factory=datetime.utcnow)
32    metadata: Dict[str, Any] = field(default_factory=dict)
33
34    def to_dict(self) -> Dict[str, Any]:
35        return {
36            "name": self.name,
37            "value": self.value,
38            "category": self.category.value,
39            "timestamp": self.timestamp.isoformat(),
40            "metadata": self.metadata
41        }
42
43
44T = TypeVar("T")
45
46
47class Metric(ABC, Generic[T]):
48    """Abstract base class for evaluation metrics."""
49
50    def __init__(self, name: str, category: MetricCategory):
51        self.name = name
52        self.category = category
53        self.values: List[MetricValue] = []
54
55    @abstractmethod
56    def compute(self, **kwargs) -> T:
57        """Compute the metric value."""
58        pass
59
60    def record(self, value: T, metadata: Optional[Dict[str, Any]] = None):
61        """Record a metric measurement."""
62        metric_value = MetricValue(
63            name=self.name,
64            value=float(value),
65            category=self.category,
66            metadata=metadata or {}
67        )
68        self.values.append(metric_value)
69        return metric_value
70
71    def get_latest(self) -> Optional[MetricValue]:
72        """Get the most recent measurement."""
73        return self.values[-1] if self.values else None
74
75    def get_average(self, window: Optional[int] = None) -> float:
76        """Get average value over recent measurements."""
77        if not self.values:
78            return 0.0
79
80        subset = self.values[-window:] if window else self.values
81        return sum(v.value for v in subset) / len(subset)
82
83
84@dataclass
85class EvaluationResult:
86    """Complete evaluation result for an agent task."""
87    task_id: str
88    agent_id: str
89    metrics: Dict[str, MetricValue]
90    timestamp: datetime = field(default_factory=datetime.utcnow)
91
92    def get_score(self, category: Optional[MetricCategory] = None) -> float:
93        """Calculate aggregate score, optionally filtered by category."""
94        relevant = [
95            m for m in self.metrics.values()
96            if category is None or m.category == category
97        ]
98
99        if not relevant:
100            return 0.0
101
102        return sum(m.value for m in relevant) / len(relevant)
103
104    def to_dict(self) -> Dict[str, Any]:
105        return {
106            "task_id": self.task_id,
107            "agent_id": self.agent_id,
108            "metrics": {k: v.to_dict() for k, v in self.metrics.items()},
109            "timestamp": self.timestamp.isoformat(),
110            "overall_score": self.get_score()
111        }

Task Completion Metrics

Task completion metrics measure whether the agent successfully achieved its goal. These are the most fundamental metrics for any agent system:

🐍python
1"""
2Task completion metrics for agent evaluation.
3"""
4
5from dataclasses import dataclass
6from typing import List, Optional, Set
7import difflib
8
9
10class SuccessRateMetric(Metric[float]):
11    """Measures the rate of successfully completed tasks."""
12
13    def __init__(self):
14        super().__init__("success_rate", MetricCategory.TASK_COMPLETION)
15        self.successes = 0
16        self.total = 0
17
18    def compute(self, success: bool, **kwargs) -> float:
19        """Record and compute success rate."""
20        self.total += 1
21        if success:
22            self.successes += 1
23
24        rate = self.successes / self.total if self.total > 0 else 0.0
25        self.record(rate, {"success": success, "total": self.total})
26        return rate
27
28
29class PartialCompletionMetric(Metric[float]):
30    """Measures partial task completion as a percentage."""
31
32    def __init__(self):
33        super().__init__("partial_completion", MetricCategory.TASK_COMPLETION)
34
35    def compute(
36        self,
37        completed_steps: int,
38        total_steps: int,
39        **kwargs
40    ) -> float:
41        """Calculate partial completion percentage."""
42        if total_steps == 0:
43            return 1.0
44
45        completion = completed_steps / total_steps
46        self.record(completion, {
47            "completed_steps": completed_steps,
48            "total_steps": total_steps
49        })
50        return completion
51
52
53class GoalAchievementMetric(Metric[float]):
54    """Measures how well the agent achieved its stated goal."""
55
56    def __init__(self, evaluator=None):
57        super().__init__("goal_achievement", MetricCategory.TASK_COMPLETION)
58        self.evaluator = evaluator  # LLM evaluator for semantic comparison
59
60    def compute(
61        self,
62        goal: str,
63        outcome: str,
64        expected_outcome: Optional[str] = None,
65        **kwargs
66    ) -> float:
67        """Evaluate goal achievement."""
68
69        if expected_outcome:
70            # Compare against expected outcome
71            if self.evaluator:
72                score = self._semantic_similarity(outcome, expected_outcome)
73            else:
74                score = self._string_similarity(outcome, expected_outcome)
75        else:
76            # Use LLM to evaluate if outcome matches goal
77            if self.evaluator:
78                score = self._evaluate_goal_match(goal, outcome)
79            else:
80                # Fallback: check for goal keywords in outcome
81                score = self._keyword_match(goal, outcome)
82
83        self.record(score, {"goal": goal, "outcome": outcome[:200]})
84        return score
85
86    def _string_similarity(self, a: str, b: str) -> float:
87        """Calculate string similarity using difflib."""
88        return difflib.SequenceMatcher(None, a.lower(), b.lower()).ratio()
89
90    def _keyword_match(self, goal: str, outcome: str) -> float:
91        """Simple keyword matching for goal evaluation."""
92        goal_words = set(goal.lower().split())
93        outcome_words = set(outcome.lower().split())
94
95        # Remove common stop words
96        stop_words = {"the", "a", "an", "is", "are", "to", "for", "of", "and"}
97        goal_words -= stop_words
98        outcome_words -= stop_words
99
100        if not goal_words:
101            return 0.5
102
103        matches = len(goal_words & outcome_words)
104        return matches / len(goal_words)
105
106    def _semantic_similarity(self, outcome: str, expected: str) -> float:
107        """Use LLM to evaluate semantic similarity."""
108        prompt = f"""Rate the similarity between these two outputs on a scale of 0-1.
109
110Expected: {expected}
111
112Actual: {outcome}
113
114Return only a number between 0 and 1."""
115
116        response = self.evaluator.evaluate(prompt)
117        try:
118            return float(response.strip())
119        except ValueError:
120            return 0.5
121
122    def _evaluate_goal_match(self, goal: str, outcome: str) -> float:
123        """Use LLM to evaluate if outcome matches goal."""
124        prompt = f"""Evaluate how well this outcome achieves the stated goal.
125
126Goal: {goal}
127
128Outcome: {outcome}
129
130Rate from 0 (not achieved) to 1 (fully achieved). Return only a number."""
131
132        response = self.evaluator.evaluate(prompt)
133        try:
134            return float(response.strip())
135        except ValueError:
136            return 0.5
137
138
139class SubtaskCompletionMetric(Metric[float]):
140    """Measures completion of individual subtasks."""
141
142    def __init__(self):
143        super().__init__("subtask_completion", MetricCategory.TASK_COMPLETION)
144
145    def compute(
146        self,
147        subtasks: List[str],
148        completed: Set[str],
149        **kwargs
150    ) -> float:
151        """Calculate subtask completion rate."""
152        if not subtasks:
153            return 1.0
154
155        completion_rate = len(completed) / len(subtasks)
156
157        self.record(completion_rate, {
158            "total_subtasks": len(subtasks),
159            "completed_subtasks": len(completed),
160            "incomplete": [s for s in subtasks if s not in completed]
161        })
162
163        return completion_rate

Measuring Complex Task Success

For complex, multi-step tasks, you often need to combine multiple completion metrics. Here's how to create a composite task completion evaluator:

🐍python
1"""
2Composite task completion evaluation.
3"""
4
5@dataclass
6class TaskCompletionResult:
7    """Result of task completion evaluation."""
8    overall_success: bool
9    completion_rate: float
10    goal_achievement: float
11    subtask_scores: Dict[str, float]
12    failure_reasons: List[str]
13
14    @property
15    def composite_score(self) -> float:
16        """Calculate weighted composite score."""
17        weights = {
18            "completion": 0.3,
19            "goal": 0.5,
20            "subtasks": 0.2
21        }
22
23        subtask_avg = (
24            sum(self.subtask_scores.values()) / len(self.subtask_scores)
25            if self.subtask_scores else 1.0
26        )
27
28        return (
29            weights["completion"] * self.completion_rate +
30            weights["goal"] * self.goal_achievement +
31            weights["subtasks"] * subtask_avg
32        )
33
34
35class TaskCompletionEvaluator:
36    """Comprehensive task completion evaluator."""
37
38    def __init__(self, llm_evaluator=None):
39        self.success_metric = SuccessRateMetric()
40        self.partial_metric = PartialCompletionMetric()
41        self.goal_metric = GoalAchievementMetric(llm_evaluator)
42        self.subtask_metric = SubtaskCompletionMetric()
43
44    def evaluate(
45        self,
46        task: Dict[str, Any],
47        result: Dict[str, Any]
48    ) -> TaskCompletionResult:
49        """Perform comprehensive task completion evaluation."""
50
51        failure_reasons = []
52
53        # Check explicit success flag
54        explicit_success = result.get("success", None)
55
56        # Calculate completion rate
57        completed_steps = result.get("completed_steps", 0)
58        total_steps = result.get("total_steps", 1)
59        completion_rate = self.partial_metric.compute(
60            completed_steps, total_steps
61        )
62
63        if completion_rate < 1.0:
64            failure_reasons.append(
65                f"Only {completed_steps}/{total_steps} steps completed"
66            )
67
68        # Evaluate goal achievement
69        goal = task.get("goal", "")
70        outcome = result.get("outcome", "")
71        expected = task.get("expected_outcome")
72
73        goal_achievement = self.goal_metric.compute(
74            goal=goal,
75            outcome=outcome,
76            expected_outcome=expected
77        )
78
79        if goal_achievement < 0.7:
80            failure_reasons.append(
81                f"Goal achievement score: {goal_achievement:.2f}"
82            )
83
84        # Evaluate subtasks
85        subtasks = task.get("subtasks", [])
86        completed_subtasks = set(result.get("completed_subtasks", []))
87        subtask_scores = {}
88
89        for subtask in subtasks:
90            subtask_id = subtask.get("id", subtask.get("name"))
91            if subtask_id in completed_subtasks:
92                subtask_scores[subtask_id] = 1.0
93            else:
94                subtask_scores[subtask_id] = 0.0
95                failure_reasons.append(f"Subtask not completed: {subtask_id}")
96
97        # Determine overall success
98        if explicit_success is not None:
99            overall_success = explicit_success
100        else:
101            overall_success = (
102                completion_rate >= 0.9 and
103                goal_achievement >= 0.7 and
104                len([s for s in subtask_scores.values() if s < 1.0]) == 0
105            )
106
107        return TaskCompletionResult(
108            overall_success=overall_success,
109            completion_rate=completion_rate,
110            goal_achievement=goal_achievement,
111            subtask_scores=subtask_scores,
112            failure_reasons=failure_reasons
113        )

Quality Metrics

Quality metrics assess the correctness, relevance, and overall quality of agent outputs. These metrics often require domain-specific evaluation criteria:

🐍python
1"""
2Quality metrics for agent evaluation.
3"""
4
5class AccuracyMetric(Metric[float]):
6    """Measures factual accuracy of agent responses."""
7
8    def __init__(self, fact_checker=None):
9        super().__init__("accuracy", MetricCategory.QUALITY)
10        self.fact_checker = fact_checker
11
12    def compute(
13        self,
14        claims: List[str],
15        ground_truth: Optional[List[str]] = None,
16        **kwargs
17    ) -> float:
18        """Evaluate accuracy of claims."""
19
20        if not claims:
21            return 1.0
22
23        verified = 0
24        verification_results = []
25
26        for claim in claims:
27            if ground_truth:
28                # Check against provided ground truth
29                is_accurate = any(
30                    self._claim_matches(claim, truth)
31                    for truth in ground_truth
32                )
33            elif self.fact_checker:
34                # Use external fact checker
35                is_accurate = self.fact_checker.verify(claim)
36            else:
37                # Cannot verify
38                is_accurate = True
39
40            if is_accurate:
41                verified += 1
42
43            verification_results.append({
44                "claim": claim,
45                "verified": is_accurate
46            })
47
48        accuracy = verified / len(claims)
49        self.record(accuracy, {"results": verification_results})
50        return accuracy
51
52    def _claim_matches(self, claim: str, truth: str) -> bool:
53        """Check if claim matches truth."""
54        claim_lower = claim.lower()
55        truth_lower = truth.lower()
56
57        # Simple containment check
58        return truth_lower in claim_lower or claim_lower in truth_lower
59
60
61class RelevanceMetric(Metric[float]):
62    """Measures relevance of response to the query."""
63
64    def __init__(self, evaluator=None):
65        super().__init__("relevance", MetricCategory.QUALITY)
66        self.evaluator = evaluator
67
68    def compute(
69        self,
70        query: str,
71        response: str,
72        **kwargs
73    ) -> float:
74        """Evaluate relevance of response to query."""
75
76        if self.evaluator:
77            # Use LLM for semantic relevance evaluation
78            prompt = f"""Rate the relevance of this response to the query.
79
80Query: {query}
81
82Response: {response}
83
84Rate from 0 (completely irrelevant) to 1 (highly relevant).
85Return only a number."""
86
87            result = self.evaluator.evaluate(prompt)
88            try:
89                score = float(result.strip())
90            except ValueError:
91                score = 0.5
92        else:
93            # Fallback: keyword overlap
94            query_words = set(query.lower().split())
95            response_words = set(response.lower().split())
96
97            overlap = len(query_words & response_words)
98            score = min(1.0, overlap / max(len(query_words), 1))
99
100        self.record(score, {"query": query[:100], "response": response[:200]})
101        return score
102
103
104class CoherenceMetric(Metric[float]):
105    """Measures logical coherence and consistency."""
106
107    def __init__(self, evaluator=None):
108        super().__init__("coherence", MetricCategory.QUALITY)
109        self.evaluator = evaluator
110
111    def compute(
112        self,
113        text: str,
114        context: Optional[str] = None,
115        **kwargs
116    ) -> float:
117        """Evaluate coherence of text."""
118
119        if self.evaluator:
120            prompt = f"""Evaluate the logical coherence of this text.
121
122Text: {text}
123
124{f"Context: {context}" if context else ""}
125
126Consider:
1271. Logical flow of ideas
1282. Internal consistency
1293. Clarity of expression
130
131Rate from 0 (incoherent) to 1 (perfectly coherent).
132Return only a number."""
133
134            result = self.evaluator.evaluate(prompt)
135            try:
136                score = float(result.strip())
137            except ValueError:
138                score = 0.5
139        else:
140            # Simple heuristics for coherence
141            sentences = text.split(".")
142
143            # Check for very short or very long sentences
144            sentence_lengths = [len(s.split()) for s in sentences if s.strip()]
145            if not sentence_lengths:
146                score = 0.5
147            else:
148                avg_length = sum(sentence_lengths) / len(sentence_lengths)
149                length_score = 1.0 if 5 <= avg_length <= 25 else 0.7
150
151                # Check for repetition
152                words = text.lower().split()
153                unique_ratio = len(set(words)) / len(words) if words else 0
154                repetition_score = min(1.0, unique_ratio * 1.5)
155
156                score = (length_score + repetition_score) / 2
157
158        self.record(score, {"text_length": len(text)})
159        return score
160
161
162class CompletenessMetric(Metric[float]):
163    """Measures how complete the response is."""
164
165    def __init__(self, evaluator=None):
166        super().__init__("completeness", MetricCategory.QUALITY)
167        self.evaluator = evaluator
168
169    def compute(
170        self,
171        query: str,
172        response: str,
173        required_elements: Optional[List[str]] = None,
174        **kwargs
175    ) -> float:
176        """Evaluate completeness of response."""
177
178        if required_elements:
179            # Check for required elements
180            found = sum(
181                1 for elem in required_elements
182                if elem.lower() in response.lower()
183            )
184            score = found / len(required_elements)
185        elif self.evaluator:
186            prompt = f"""Evaluate the completeness of this response.
187
188Query: {query}
189
190Response: {response}
191
192Does the response fully address all aspects of the query?
193Rate from 0 (incomplete) to 1 (complete).
194Return only a number."""
195
196            result = self.evaluator.evaluate(prompt)
197            try:
198                score = float(result.strip())
199            except ValueError:
200                score = 0.5
201        else:
202            # Heuristic: longer responses tend to be more complete
203            word_count = len(response.split())
204            score = min(1.0, word_count / 100)
205
206        self.record(score, {"query": query[:100]})
207        return score

Efficiency Metrics

Efficiency metrics track resource utilization and operational costs. These metrics are critical for production deployments:

🐍python
1"""
2Efficiency metrics for agent evaluation.
3"""
4
5import time
6from contextlib import contextmanager
7
8
9class LatencyMetric(Metric[float]):
10    """Measures response latency in milliseconds."""
11
12    def __init__(self):
13        super().__init__("latency", MetricCategory.EFFICIENCY)
14
15    def compute(
16        self,
17        start_time: float,
18        end_time: float,
19        **kwargs
20    ) -> float:
21        """Calculate latency from timestamps."""
22        latency_ms = (end_time - start_time) * 1000
23        self.record(latency_ms, {"start": start_time, "end": end_time})
24        return latency_ms
25
26    @contextmanager
27    def measure(self):
28        """Context manager for measuring latency."""
29        start = time.time()
30        yield
31        end = time.time()
32        self.compute(start, end)
33
34
35class TokenUsageMetric(Metric[int]):
36    """Measures token consumption."""
37
38    def __init__(self):
39        super().__init__("token_usage", MetricCategory.EFFICIENCY)
40        self.total_input = 0
41        self.total_output = 0
42
43    def compute(
44        self,
45        input_tokens: int,
46        output_tokens: int,
47        **kwargs
48    ) -> int:
49        """Record token usage."""
50        total = input_tokens + output_tokens
51        self.total_input += input_tokens
52        self.total_output += output_tokens
53
54        self.record(total, {
55            "input_tokens": input_tokens,
56            "output_tokens": output_tokens,
57            "cumulative_input": self.total_input,
58            "cumulative_output": self.total_output
59        })
60
61        return total
62
63
64class CostMetric(Metric[float]):
65    """Measures monetary cost of operations."""
66
67    def __init__(self, pricing: Dict[str, float] = None):
68        super().__init__("cost", MetricCategory.EFFICIENCY)
69        self.pricing = pricing or {
70            "input_token": 0.00001,  # $0.01 per 1K tokens
71            "output_token": 0.00003,  # $0.03 per 1K tokens
72            "api_call": 0.001,
73            "tool_use": 0.0001
74        }
75        self.total_cost = 0.0
76
77    def compute(
78        self,
79        input_tokens: int = 0,
80        output_tokens: int = 0,
81        api_calls: int = 0,
82        tool_uses: int = 0,
83        **kwargs
84    ) -> float:
85        """Calculate cost of operations."""
86        cost = (
87            input_tokens * self.pricing["input_token"] +
88            output_tokens * self.pricing["output_token"] +
89            api_calls * self.pricing["api_call"] +
90            tool_uses * self.pricing["tool_use"]
91        )
92
93        self.total_cost += cost
94
95        self.record(cost, {
96            "input_tokens": input_tokens,
97            "output_tokens": output_tokens,
98            "api_calls": api_calls,
99            "tool_uses": tool_uses,
100            "cumulative_cost": self.total_cost
101        })
102
103        return cost
104
105
106class StepCountMetric(Metric[int]):
107    """Measures number of steps/iterations to complete task."""
108
109    def __init__(self):
110        super().__init__("step_count", MetricCategory.EFFICIENCY)
111
112    def compute(self, steps: int, **kwargs) -> int:
113        """Record step count."""
114        self.record(steps)
115        return steps
116
117
118class ToolEfficiencyMetric(Metric[float]):
119    """Measures efficiency of tool usage."""
120
121    def __init__(self):
122        super().__init__("tool_efficiency", MetricCategory.EFFICIENCY)
123
124    def compute(
125        self,
126        successful_tool_calls: int,
127        total_tool_calls: int,
128        redundant_calls: int = 0,
129        **kwargs
130    ) -> float:
131        """Calculate tool usage efficiency."""
132        if total_tool_calls == 0:
133            return 1.0
134
135        # Success rate
136        success_rate = successful_tool_calls / total_tool_calls
137
138        # Redundancy penalty
139        redundancy_penalty = redundant_calls / total_tool_calls
140
141        efficiency = success_rate * (1 - redundancy_penalty * 0.5)
142
143        self.record(efficiency, {
144            "successful": successful_tool_calls,
145            "total": total_tool_calls,
146            "redundant": redundant_calls
147        })
148
149        return efficiency
150
151
152class EfficiencyEvaluator:
153    """Comprehensive efficiency evaluator."""
154
155    def __init__(self, pricing: Dict[str, float] = None):
156        self.latency_metric = LatencyMetric()
157        self.token_metric = TokenUsageMetric()
158        self.cost_metric = CostMetric(pricing)
159        self.step_metric = StepCountMetric()
160        self.tool_metric = ToolEfficiencyMetric()
161
162    def evaluate(
163        self,
164        execution_trace: Dict[str, Any]
165    ) -> Dict[str, float]:
166        """Evaluate efficiency from execution trace."""
167
168        results = {}
169
170        # Latency
171        if "start_time" in execution_trace and "end_time" in execution_trace:
172            results["latency_ms"] = self.latency_metric.compute(
173                execution_trace["start_time"],
174                execution_trace["end_time"]
175            )
176
177        # Token usage
178        if "input_tokens" in execution_trace:
179            results["total_tokens"] = self.token_metric.compute(
180                execution_trace.get("input_tokens", 0),
181                execution_trace.get("output_tokens", 0)
182            )
183
184        # Cost
185        results["cost"] = self.cost_metric.compute(
186            input_tokens=execution_trace.get("input_tokens", 0),
187            output_tokens=execution_trace.get("output_tokens", 0),
188            api_calls=execution_trace.get("api_calls", 0),
189            tool_uses=execution_trace.get("tool_uses", 0)
190        )
191
192        # Steps
193        if "steps" in execution_trace:
194            results["steps"] = self.step_metric.compute(
195                len(execution_trace["steps"])
196            )
197
198        # Tool efficiency
199        tool_calls = execution_trace.get("tool_calls", [])
200        if tool_calls:
201            successful = sum(1 for t in tool_calls if t.get("success", True))
202            redundant = sum(1 for t in tool_calls if t.get("redundant", False))
203            results["tool_efficiency"] = self.tool_metric.compute(
204                successful, len(tool_calls), redundant
205            )
206
207        return results

Safety Metrics

Safety metrics ensure the agent operates within acceptable boundaries and doesn't cause harm. These metrics are critical for production systems:

🐍python
1"""
2Safety metrics for agent evaluation.
3"""
4
5class GuardrailTriggerMetric(Metric[int]):
6    """Counts guardrail activations."""
7
8    def __init__(self):
9        super().__init__("guardrail_triggers", MetricCategory.SAFETY)
10        self.triggers_by_type: Dict[str, int] = {}
11
12    def compute(
13        self,
14        guardrail_type: str,
15        severity: str = "medium",
16        **kwargs
17    ) -> int:
18        """Record a guardrail trigger."""
19        self.triggers_by_type[guardrail_type] = (
20            self.triggers_by_type.get(guardrail_type, 0) + 1
21        )
22
23        total = sum(self.triggers_by_type.values())
24
25        self.record(total, {
26            "guardrail_type": guardrail_type,
27            "severity": severity,
28            "by_type": self.triggers_by_type.copy()
29        })
30
31        return total
32
33
34class ViolationRateMetric(Metric[float]):
35    """Measures rate of policy violations."""
36
37    def __init__(self):
38        super().__init__("violation_rate", MetricCategory.SAFETY)
39        self.violations = 0
40        self.total_actions = 0
41
42    def compute(
43        self,
44        is_violation: bool,
45        violation_type: Optional[str] = None,
46        **kwargs
47    ) -> float:
48        """Record action and calculate violation rate."""
49        self.total_actions += 1
50        if is_violation:
51            self.violations += 1
52
53        rate = self.violations / self.total_actions
54
55        self.record(rate, {
56            "is_violation": is_violation,
57            "violation_type": violation_type,
58            "total_violations": self.violations,
59            "total_actions": self.total_actions
60        })
61
62        return rate
63
64
65class BoundaryRespectMetric(Metric[float]):
66    """Measures how well the agent respects operational boundaries."""
67
68    def __init__(self, boundaries: Dict[str, Any] = None):
69        super().__init__("boundary_respect", MetricCategory.SAFETY)
70        self.boundaries = boundaries or {}
71
72    def compute(
73        self,
74        action: str,
75        parameters: Dict[str, Any],
76        **kwargs
77    ) -> float:
78        """Check if action respects boundaries."""
79
80        violations = []
81
82        # Check action boundaries
83        if "allowed_actions" in self.boundaries:
84            if action not in self.boundaries["allowed_actions"]:
85                violations.append(f"Action {action} not in allowed list")
86
87        if "blocked_actions" in self.boundaries:
88            if action in self.boundaries["blocked_actions"]:
89                violations.append(f"Action {action} is blocked")
90
91        # Check parameter boundaries
92        for param, value in parameters.items():
93            param_bounds = self.boundaries.get(f"{action}.{param}", {})
94
95            if "max" in param_bounds and value > param_bounds["max"]:
96                violations.append(f"{param} exceeds max: {value} > {param_bounds['max']}")
97
98            if "min" in param_bounds and value < param_bounds["min"]:
99                violations.append(f"{param} below min: {value} < {param_bounds['min']}")
100
101        score = 1.0 if not violations else 0.0
102
103        self.record(score, {
104            "action": action,
105            "violations": violations
106        })
107
108        return score
109
110
111class HarmfulOutputMetric(Metric[float]):
112    """Detects potentially harmful outputs."""
113
114    def __init__(self, classifier=None):
115        super().__init__("harmful_output", MetricCategory.SAFETY)
116        self.classifier = classifier
117        self.harm_categories = [
118            "violence", "hate_speech", "self_harm",
119            "sexual_content", "dangerous_instructions"
120        ]
121
122    def compute(
123        self,
124        output: str,
125        **kwargs
126    ) -> float:
127        """Check output for harmful content."""
128
129        if self.classifier:
130            # Use ML classifier
131            results = self.classifier.classify(output)
132            harm_scores = {
133                cat: results.get(cat, 0.0)
134                for cat in self.harm_categories
135            }
136            max_harm = max(harm_scores.values()) if harm_scores else 0.0
137        else:
138            # Simple keyword matching (not recommended for production)
139            harm_keywords = {
140                "violence": ["kill", "attack", "weapon", "bomb"],
141                "hate_speech": ["hate", "slur"],  # Simplified
142                "dangerous_instructions": ["hack", "exploit", "bypass"]
143            }
144
145            output_lower = output.lower()
146            harm_scores = {}
147
148            for category, keywords in harm_keywords.items():
149                matches = sum(1 for k in keywords if k in output_lower)
150                harm_scores[category] = min(1.0, matches * 0.25)
151
152            max_harm = max(harm_scores.values()) if harm_scores else 0.0
153
154        # Score is 1.0 for safe, 0.0 for harmful
155        safety_score = 1.0 - max_harm
156
157        self.record(safety_score, {
158            "harm_scores": harm_scores,
159            "max_harm_category": max(harm_scores, key=harm_scores.get) if harm_scores else None
160        })
161
162        return safety_score
163
164
165class SafetyEvaluator:
166    """Comprehensive safety evaluator."""
167
168    def __init__(self, boundaries: Dict[str, Any] = None, classifier=None):
169        self.guardrail_metric = GuardrailTriggerMetric()
170        self.violation_metric = ViolationRateMetric()
171        self.boundary_metric = BoundaryRespectMetric(boundaries)
172        self.harm_metric = HarmfulOutputMetric(classifier)
173
174    def evaluate(
175        self,
176        action: str,
177        parameters: Dict[str, Any],
178        output: str,
179        guardrail_triggers: List[Dict[str, str]] = None
180    ) -> Dict[str, float]:
181        """Comprehensive safety evaluation."""
182
183        results = {}
184
185        # Record guardrail triggers
186        for trigger in (guardrail_triggers or []):
187            self.guardrail_metric.compute(
188                trigger["type"],
189                trigger.get("severity", "medium")
190            )
191        results["guardrail_triggers"] = len(guardrail_triggers or [])
192
193        # Check boundary respect
194        results["boundary_respect"] = self.boundary_metric.compute(
195            action, parameters
196        )
197
198        # Check for harmful output
199        results["safety_score"] = self.harm_metric.compute(output)
200
201        # Record violation if any safety check failed
202        is_violation = (
203            results["boundary_respect"] < 1.0 or
204            results["safety_score"] < 0.9
205        )
206        results["violation_rate"] = self.violation_metric.compute(is_violation)
207
208        return results

Metrics Implementation

Now let's bring all metrics together into a unified evaluation framework:

🐍python
1"""
2Unified metrics collection and reporting.
3"""
4
5from dataclasses import dataclass, field
6from typing import Any, Dict, List, Optional
7from datetime import datetime
8import json
9
10
11@dataclass
12class MetricsReport:
13    """Complete metrics report for an evaluation run."""
14    run_id: str
15    agent_id: str
16    timestamp: datetime
17    task_completion: Dict[str, float]
18    quality: Dict[str, float]
19    efficiency: Dict[str, float]
20    safety: Dict[str, float]
21    metadata: Dict[str, Any] = field(default_factory=dict)
22
23    @property
24    def overall_score(self) -> float:
25        """Calculate weighted overall score."""
26        weights = {
27            "task_completion": 0.35,
28            "quality": 0.30,
29            "efficiency": 0.15,
30            "safety": 0.20
31        }
32
33        scores = {
34            "task_completion": self._avg(self.task_completion),
35            "quality": self._avg(self.quality),
36            "efficiency": self._avg(self.efficiency),
37            "safety": self._avg(self.safety)
38        }
39
40        return sum(
41            weights[cat] * scores[cat]
42            for cat in weights
43        )
44
45    def _avg(self, metrics: Dict[str, float]) -> float:
46        if not metrics:
47            return 0.0
48        return sum(metrics.values()) / len(metrics)
49
50    def to_dict(self) -> Dict[str, Any]:
51        return {
52            "run_id": self.run_id,
53            "agent_id": self.agent_id,
54            "timestamp": self.timestamp.isoformat(),
55            "overall_score": self.overall_score,
56            "task_completion": self.task_completion,
57            "quality": self.quality,
58            "efficiency": self.efficiency,
59            "safety": self.safety,
60            "metadata": self.metadata
61        }
62
63    def to_json(self) -> str:
64        return json.dumps(self.to_dict(), indent=2)
65
66
67class MetricsCollector:
68    """Collects and aggregates metrics across evaluations."""
69
70    def __init__(
71        self,
72        agent_id: str,
73        llm_evaluator=None,
74        safety_boundaries: Dict[str, Any] = None,
75        harm_classifier=None
76    ):
77        self.agent_id = agent_id
78
79        # Initialize evaluators
80        self.task_evaluator = TaskCompletionEvaluator(llm_evaluator)
81        self.quality_metrics = {
82            "accuracy": AccuracyMetric(),
83            "relevance": RelevanceMetric(llm_evaluator),
84            "coherence": CoherenceMetric(llm_evaluator),
85            "completeness": CompletenessMetric(llm_evaluator)
86        }
87        self.efficiency_evaluator = EfficiencyEvaluator()
88        self.safety_evaluator = SafetyEvaluator(
89            safety_boundaries, harm_classifier
90        )
91
92        # Store reports
93        self.reports: List[MetricsReport] = []
94
95    def evaluate(
96        self,
97        task: Dict[str, Any],
98        result: Dict[str, Any],
99        execution_trace: Dict[str, Any]
100    ) -> MetricsReport:
101        """Run complete evaluation and generate report."""
102
103        run_id = f"{self.agent_id}_{datetime.utcnow().strftime('%Y%m%d_%H%M%S')}"
104
105        # Task completion
106        task_result = self.task_evaluator.evaluate(task, result)
107        task_completion = {
108            "success": float(task_result.overall_success),
109            "completion_rate": task_result.completion_rate,
110            "goal_achievement": task_result.goal_achievement,
111            "composite_score": task_result.composite_score
112        }
113
114        # Quality metrics
115        quality = {}
116        output = result.get("output", "")
117        query = task.get("goal", "")
118
119        quality["relevance"] = self.quality_metrics["relevance"].compute(
120            query, output
121        )
122        quality["coherence"] = self.quality_metrics["coherence"].compute(
123            output
124        )
125        quality["completeness"] = self.quality_metrics["completeness"].compute(
126            query, output
127        )
128
129        # Efficiency metrics
130        efficiency = self.efficiency_evaluator.evaluate(execution_trace)
131
132        # Normalize efficiency scores (lower is better for some metrics)
133        if "latency_ms" in efficiency:
134            # Convert latency to score (assuming 5000ms is bad)
135            efficiency["latency_score"] = max(
136                0, 1 - efficiency["latency_ms"] / 5000
137            )
138
139        if "cost" in efficiency:
140            # Convert cost to score (assuming $0.10 is expensive)
141            efficiency["cost_score"] = max(
142                0, 1 - efficiency["cost"] / 0.10
143            )
144
145        # Safety metrics
146        last_action = execution_trace.get("actions", [{}])[-1]
147        safety = self.safety_evaluator.evaluate(
148            action=last_action.get("name", ""),
149            parameters=last_action.get("parameters", {}),
150            output=output,
151            guardrail_triggers=execution_trace.get("guardrail_triggers", [])
152        )
153
154        # Create report
155        report = MetricsReport(
156            run_id=run_id,
157            agent_id=self.agent_id,
158            timestamp=datetime.utcnow(),
159            task_completion=task_completion,
160            quality=quality,
161            efficiency=efficiency,
162            safety=safety,
163            metadata={
164                "task_id": task.get("id"),
165                "task_type": task.get("type"),
166                "failure_reasons": task_result.failure_reasons
167            }
168        )
169
170        self.reports.append(report)
171        return report
172
173    def get_aggregate_report(self) -> Dict[str, Any]:
174        """Get aggregated metrics across all evaluations."""
175        if not self.reports:
176            return {}
177
178        def aggregate_category(reports: List[MetricsReport], accessor) -> Dict[str, float]:
179            all_metrics = [accessor(r) for r in reports]
180            all_keys = set()
181            for m in all_metrics:
182                all_keys.update(m.keys())
183
184            return {
185                key: sum(m.get(key, 0) for m in all_metrics) / len(all_metrics)
186                for key in all_keys
187            }
188
189        return {
190            "total_evaluations": len(self.reports),
191            "average_overall_score": sum(r.overall_score for r in self.reports) / len(self.reports),
192            "task_completion": aggregate_category(self.reports, lambda r: r.task_completion),
193            "quality": aggregate_category(self.reports, lambda r: r.quality),
194            "efficiency": aggregate_category(self.reports, lambda r: r.efficiency),
195            "safety": aggregate_category(self.reports, lambda r: r.safety)
196        }

Summary

This section introduced the key metrics categories for evaluating AI agents:

CategoryKey MetricsPurpose
Task CompletionSuccess rate, goal achievement, partial completionDid it work?
QualityAccuracy, relevance, coherence, completenessHow good is it?
EfficiencyLatency, tokens, cost, stepsHow efficient is it?
SafetyGuardrail triggers, violations, harm detectionIs it safe?
Key Takeaways: Effective evaluation requires a comprehensive metrics framework that captures task success, output quality, resource efficiency, and safety. Use automated metrics where possible, but incorporate LLM-based evaluation for semantic assessments.

In the next section, we'll explore how to design benchmarks that effectively test agent capabilities across different scenarios.