What is agent evaluation?

Measuring LLM agent performance through trajectory analysis, benchmark suites, LLM-as-judge, failure taxonomies, and production monitoring strategies.

How does LLM agent benchmarks work in practice?

Agent Evaluation covers agent evaluation, LLM agent benchmarks, AgentBench from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/llm-agents/agent-evaluation

What is the difference between agent evaluation and AgentBench?

See the full breakdown at https://engineersofai.com/docs/llms/llm-agents/agent-evaluation

Agent Evaluation

A Production Scenario

Your agent has been running in production for three weeks. Users seem happy - support tickets are down, positive feedback is up. Then a product manager asks: "Is the agent getting better or worse?" You have no idea. You have output logs but no systematic way to measure quality. You have anecdotes but no metrics.

Then a model update gets deployed automatically upstream. The agent starts giving subtly different answers. Some are better. Some are worse. You cannot tell which because you have no evaluation baseline. A week later, a major customer complains about a factual error. You trace it to the model update but cannot prove causality because you have no pre-update quality measurements.

You build a simple evaluation: a set of 50 test questions with expected answers, a checker that compares agent answers to expected answers with exact string matching. You run it. 74% pass rate. The model update ships. The pass rate drops to 67%. You roll back. Pass rate returns to 74%. You have a regression testing system. It works. But you quickly discover its limits: exact string matching fails for questions with multiple valid answers. It cannot evaluate whether the agent's reasoning was correct even if the final answer was wrong. It does not measure efficiency - an agent that calls 15 tools to answer a question that needs 2 is technically "correct" but practically terrible.

Agent evaluation is hard because agents are non-deterministic, multi-step, and produce open-ended outputs. This lesson covers the right way to measure whether an agent works - not just whether it produces the right final answer, but whether it reasons correctly, uses tools efficiently, avoids harm, and degrades gracefully when it fails.

Why Agent Evaluation Is Hard

Non-Determinism

Temperature and sampling mean the same agent run twice on the same input may produce different outputs. A single evaluation run tells you almost nothing. You need statistical aggregation across multiple runs, which multiplies evaluation cost.

Multi-Step Processes

A single agent run might involve 10 tool calls and 10 LLM calls. The final answer might be correct for the wrong reasons - the agent hallucinated in step 3 but happened to recover by step 7. Or it might be wrong because of a single bad tool selection in step 5 that was otherwise a perfect trajectory. Evaluating just the final answer misses the structure of the reasoning.

Open-Ended Outputs

There are often many correct ways to answer a question. "Summarize the key risks in this document" has no single ground truth answer. Exact-match evaluation fails completely. Even semantic similarity metrics struggle when the space of valid answers is large.

The Distribution Gap

Your evaluation set is never perfectly representative of production queries. Agents often fail on edge cases you did not think to include in your eval set. And the evaluation set degrades over time - agents learn to pattern-match the evaluation data if you use it for tuning.

Evaluation Dimensions

A complete agent evaluation measures multiple dimensions independently:

1. Task Completion Rate

Did the agent produce a correct, useful final answer? This is the most basic metric but requires careful operationalization.

from dataclasses import dataclass
from typing import Callable

@dataclass
class EvalCase:
    task_id: str
    input: str
    expected_output: str
    evaluation_fn: str  # "exact", "semantic", "llm_judge", "custom"
    metadata: dict = None


def exact_match(actual: str, expected: str) -> float:
    """Binary: 1.0 if strings match, 0.0 otherwise."""
    return 1.0 if actual.strip().lower() == expected.strip().lower() else 0.0


def contains_match(actual: str, expected: str) -> float:
    """1.0 if actual contains expected (good for fact extraction)."""
    return 1.0 if expected.lower() in actual.lower() else 0.0


def semantic_similarity(actual: str, expected: str) -> float:
    """
    Compare embeddings of actual and expected outputs.
    Returns cosine similarity [0, 1].
    """
    try:
        from sentence_transformers import SentenceTransformer
        import numpy as np
        model = SentenceTransformer('all-MiniLM-L6-v2')
        embeddings = model.encode([actual, expected])
        cosine = np.dot(embeddings[0], embeddings[1]) / (
            np.linalg.norm(embeddings[0]) * np.linalg.norm(embeddings[1])
        )
        return float(max(0.0, cosine))
    except ImportError:
        # Fallback: simple word overlap
        actual_words = set(actual.lower().split())
        expected_words = set(expected.lower().split())
        if not expected_words:
            return 1.0
        return len(actual_words & expected_words) / len(expected_words)

2. Tool Efficiency

How many tool calls did the agent need? An efficient agent uses the minimum number of tool calls required. Track both the count and the type of calls (unnecessary duplicate calls, calls that returned errors, calls on irrelevant tools).

@dataclass
class ToolCallRecord:
    tool_name: str
    args: dict
    result: str
    success: bool
    latency_ms: float


@dataclass
class AgentTrajectory:
    task_id: str
    input: str
    final_answer: str
    tool_calls: list[ToolCallRecord]
    total_input_tokens: int
    total_output_tokens: int
    wall_time_seconds: float
    succeeded: bool


def compute_tool_efficiency_metrics(trajectory: AgentTrajectory) -> dict:
    """Compute efficiency metrics from an agent trajectory."""
    calls = trajectory.tool_calls
    return {
        "total_tool_calls": len(calls),
        "successful_calls": sum(1 for c in calls if c.success),
        "failed_calls": sum(1 for c in calls if not c.success),
        "unique_tools_used": len(set(c.tool_name for c in calls)),
        "duplicate_calls": _count_duplicate_calls(calls),
        "avg_tool_latency_ms": (
            sum(c.latency_ms for c in calls) / len(calls)
            if calls else 0
        ),
    }


def _count_duplicate_calls(calls: list[ToolCallRecord]) -> int:
    """Count calls that repeated the exact same tool+args combination."""
    seen = set()
    duplicates = 0
    for call in calls:
        key = (call.tool_name, str(sorted(call.args.items())))
        if key in seen:
            duplicates += 1
        seen.add(key)
    return duplicates

3. LLM-as-Judge for Reasoning Quality

For evaluating whether the agent's reasoning was correct - not just the final answer - use a more capable LLM as a judge.

import anthropic
import json

judge_client = anthropic.Anthropic()


def llm_judge_trajectory(
    task: str,
    trajectory: AgentTrajectory,
    rubric: list[str] | None = None
) -> dict:
    """
    Use an LLM to judge the quality of an agent's trajectory.
    """
    default_rubric = [
        "Did the agent correctly identify what information was needed?",
        "Were tool selections appropriate for the subtasks?",
        "Did the agent correctly interpret tool results?",
        "Was the final answer accurate and complete?",
        "Did the agent avoid unnecessary tool calls?",
        "Was the reasoning chain logically coherent?"
    ]
    rubric = rubric or default_rubric
    rubric_text = "\n".join(f"{i+1}. {item}" for i, item in enumerate(rubric))

    # Format trajectory for the judge
    trajectory_text = "\n".join([
        f"Tool call {i+1}: {c.tool_name}({c.args}) → {c.result[:200]}{'...' if len(c.result) > 200 else ''}"
        for i, c in enumerate(trajectory.tool_calls)
    ])

    prompt = f"""You are an expert evaluator for AI agent systems.

Task the agent was given:
{task}

Agent's execution (trajectory):
{trajectory_text}

Agent's final answer:
{trajectory.final_answer}

Evaluate the agent on these dimensions:
{rubric_text}

Output a JSON object:
{{
  "scores": {{
    "tool_selection": 1-10,
    "reasoning_quality": 1-10,
    "answer_accuracy": 1-10,
    "efficiency": 1-10
  }},
  "overall_score": 1-10,
  "strengths": ["what the agent did well"],
  "weaknesses": ["what the agent did poorly"],
  "failure_category": "none | planning_failure | tool_failure | reasoning_error | infinite_loop | hallucination",
  "explanation": "brief summary of your evaluation"
}}"""

    response = judge_client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )

    import re
    json_match = re.search(r'\{.*\}', response.content[0].text, re.DOTALL)
    if json_match:
        try:
            return json.loads(json_match.group())
        except json.JSONDecodeError:
            pass

    return {
        "overall_score": 5,
        "explanation": response.content[0].text,
        "failure_category": "unknown"
    }

Building an Evaluation Harness

A production evaluation harness orchestrates the full eval pipeline: running the agent on test cases, collecting trajectories, scoring them, and generating reports.

import asyncio
from datetime import datetime
from typing import Any


class AgentEvaluationHarness:
    """
    Complete evaluation harness for LLM agents.
    """

    def __init__(
        self,
        agent_fn: Callable,  # Function: (task: str) -> AgentTrajectory
        eval_cases: list[EvalCase],
        n_runs_per_case: int = 3,  # Run each case N times for stability
    ):
        self.agent_fn = agent_fn
        self.eval_cases = eval_cases
        self.n_runs_per_case = n_runs_per_case
        self.results: list[dict] = []

    def run_single_case(self, case: EvalCase) -> dict:
        """Run a single eval case N times and aggregate results."""
        run_results = []

        for run_idx in range(self.n_runs_per_case):
            trajectory = self.agent_fn(case.input)

            # Score task completion
            if case.evaluation_fn == "exact":
                completion_score = exact_match(trajectory.final_answer, case.expected_output)
            elif case.evaluation_fn == "contains":
                completion_score = contains_match(trajectory.final_answer, case.expected_output)
            elif case.evaluation_fn == "semantic":
                completion_score = semantic_similarity(trajectory.final_answer, case.expected_output)
            else:
                completion_score = 0.5  # Unknown eval fn

            # Score trajectory quality via LLM judge
            judge_result = llm_judge_trajectory(case.input, trajectory)

            # Compute efficiency metrics
            efficiency = compute_tool_efficiency_metrics(trajectory)

            run_results.append({
                "run_idx": run_idx,
                "completion_score": completion_score,
                "judge_score": judge_result.get("overall_score", 0) / 10,
                "efficiency": efficiency,
                "failure_category": judge_result.get("failure_category", "none"),
                "total_tokens": trajectory.total_input_tokens + trajectory.total_output_tokens,
                "wall_time": trajectory.wall_time_seconds,
                "succeeded": trajectory.succeeded,
            })

        # Aggregate across runs
        return {
            "task_id": case.task_id,
            "input": case.input,
            "n_runs": self.n_runs_per_case,
            "completion_rate": sum(r["completion_score"] for r in run_results) / self.n_runs_per_case,
            "avg_judge_score": sum(r["judge_score"] for r in run_results) / self.n_runs_per_case,
            "avg_tool_calls": sum(r["efficiency"]["total_tool_calls"] for r in run_results) / self.n_runs_per_case,
            "avg_tokens": sum(r["total_tokens"] for r in run_results) / self.n_runs_per_case,
            "avg_wall_time": sum(r["wall_time"] for r in run_results) / self.n_runs_per_case,
            "success_rate": sum(1 for r in run_results if r["succeeded"]) / self.n_runs_per_case,
            "failure_categories": [r["failure_category"] for r in run_results],
        }

    def run_full_evaluation(self) -> dict:
        """Run evaluation on all cases and generate a report."""
        print(f"Running evaluation: {len(self.eval_cases)} cases x {self.n_runs_per_case} runs")
        start_time = datetime.utcnow()

        for i, case in enumerate(self.eval_cases):
            print(f"  [{i+1}/{len(self.eval_cases)}] {case.task_id}")
            result = self.run_single_case(case)
            self.results.append(result)

        return self.generate_report(start_time)

    def generate_report(self, start_time: datetime) -> dict:
        """Generate a summary report from all evaluation results."""
        if not self.results:
            return {"error": "No results to report"}

        n = len(self.results)
        avg = lambda key: sum(r[key] for r in self.results) / n

        # Aggregate failure categories
        all_failures = []
        for r in self.results:
            all_failures.extend(r.get("failure_categories", []))
        failure_counts = {}
        for f in all_failures:
            failure_counts[f] = failure_counts.get(f, 0) + 1

        # Find worst-performing cases
        worst_cases = sorted(self.results, key=lambda r: r["completion_rate"])[:5]

        report = {
            "evaluation_time": (datetime.utcnow() - start_time).total_seconds(),
            "n_cases": n,
            "summary": {
                "completion_rate": avg("completion_rate"),
                "avg_judge_score": avg("avg_judge_score"),
                "success_rate": avg("success_rate"),
                "avg_tool_calls": avg("avg_tool_calls"),
                "avg_tokens_per_task": avg("avg_tokens"),
                "avg_latency_seconds": avg("avg_wall_time"),
            },
            "failure_distribution": failure_counts,
            "worst_performing_cases": [
                {"task_id": r["task_id"], "completion_rate": r["completion_rate"]}
                for r in worst_cases
            ],
            "timestamp": datetime.utcnow().isoformat()
        }

        # Print summary
        print("\n=== EVALUATION REPORT ===")
        print(f"Tasks evaluated: {report['n_cases']}")
        print(f"Completion rate: {report['summary']['completion_rate']:.1%}")
        print(f"Avg judge score: {report['summary']['avg_judge_score']:.2f}/1.0")
        print(f"Avg tool calls: {report['summary']['avg_tool_calls']:.1f}")
        print(f"Failure distribution: {failure_counts}")

        return report

Standard Benchmarks

WebArena

WebArena (Zhou et al., 2023) evaluates agents on realistic web navigation tasks. Agents must complete tasks on real websites (shopping, Reddit, GitLab, Wikipedia) using a browser. Tasks include: "Find the cheapest product in the Tools category" or "Post a comment on the issue about memory leaks."

What it measures: Web navigation ability, multi-step planning, information extraction from HTML.

Score format: Task completion rate (binary per task). Baseline GPT-4 agent: ~14%. State-of-the-art (2024): ~35-40%.

AgentBench

AgentBench (Liu et al., 2023) covers 8 distinct environments: code execution, databases, knowledge graphs, file system, web shopping, web browsing, house tasks, and lateral thinking puzzles.

What it measures: Generalization across domains, tool use in different contexts.

GAIA

GAIA (Mialon et al., 2023) from Meta and HuggingFace tests agents on real-world questions that require multi-step reasoning, tool use, and web browsing. Questions are deliberately hard for current models - requiring 5-7 steps and multiple information sources.

Score format: Accuracy on 3 difficulty levels. Human baseline: 92%. GPT-4 with plugins baseline: ~30%.

SWE-Bench

SWE-Bench (Jimenez et al., 2023) evaluates code agents on real GitHub issues from popular Python repos. The agent must read the issue, explore the codebase, write a patch, and pass the test suite.

What it measures: Real-world software engineering ability.

Score: % of issues resolved. Best models (2024): ~20-40%.

Golden Trajectory Testing

For regression testing, save the trajectories from known-good agent runs and compare future runs against them.

import hashlib

class GoldenTrajectoryStore:
    """Store and compare agent trajectories for regression testing."""

    def __init__(self):
        self._golden_trajectories: dict[str, dict] = {}

    def save_golden(self, task_id: str, trajectory: AgentTrajectory) -> None:
        """Save a known-good trajectory as the golden reference."""
        self._golden_trajectories[task_id] = {
            "tool_sequence": [c.tool_name for c in trajectory.tool_calls],
            "tool_count": len(trajectory.tool_calls),
            "final_answer": trajectory.final_answer,
            "answer_hash": hashlib.md5(trajectory.final_answer.encode()).hexdigest()
        }

    def compare_against_golden(
        self,
        task_id: str,
        new_trajectory: AgentTrajectory
    ) -> dict:
        """Compare a new trajectory against the golden reference."""
        if task_id not in self._golden_trajectories:
            return {"status": "no_golden", "message": "No golden trajectory stored"}

        golden = self._golden_trajectories[task_id]
        new_seq = [c.tool_name for c in new_trajectory.tool_calls]

        regressions = []

        # Check if tool count changed significantly
        tool_count_delta = len(new_trajectory.tool_calls) - golden["tool_count"]
        if abs(tool_count_delta) > 2:
            regressions.append(
                f"Tool count changed: {golden['tool_count']} → {len(new_trajectory.tool_calls)}"
            )

        # Check if tool sequence diverges significantly
        if new_seq[:3] != golden["tool_sequence"][:3]:
            regressions.append(
                f"Tool sequence diverged: expected {golden['tool_sequence'][:3]}, "
                f"got {new_seq[:3]}"
            )

        # Check final answer (semantic, not exact)
        answer_sim = semantic_similarity(new_trajectory.final_answer, golden["final_answer"])
        if answer_sim < 0.7:
            regressions.append(f"Final answer significantly different (similarity: {answer_sim:.2f})")

        return {
            "status": "regression" if regressions else "pass",
            "regressions": regressions,
            "tool_count_delta": tool_count_delta,
            "answer_similarity": answer_sim
        }

Failure Mode Taxonomy

Understanding how agents fail helps you prioritize fixes. The main failure categories:

Category	Description	Example	Fix
Planning failure	Agent misunderstands the task from the start	Researches wrong topic	Improve task parsing
Tool selection failure	Calls the wrong tool	Uses web search instead of database query	Improve tool descriptions
Tool execution failure	Tool call fails (error, timeout)	API returns 500	Better error handling
Reasoning error	Misinterprets tool result	Reads "price is $50/month" as$ 50/year	Clearer result formatting
Infinite loop	Repeats same action	Calls search 10x with same query	Loop detection
Hallucination	Invents information not in tool results	Cites a non-existent study	Grounding checks
Premature termination	Stops before completing the task	Answers with partial info	Completion checkers

Production Monitoring

Evaluation is not just pre-launch testing. You need continuous evaluation in production.

import random

class ProductionEvaluationMonitor:
    """
    Continuously evaluate a sample of production agent runs.
    """

    def __init__(
        self,
        sample_rate: float = 0.05,  # Evaluate 5% of prod runs
        eval_backlog_size: int = 1000
    ):
        self.sample_rate = sample_rate
        self.eval_backlog_size = eval_backlog_size
        self._recent_scores: list[float] = []
        self._alert_threshold: float = 0.7

    def should_evaluate(self) -> bool:
        """Decide whether to evaluate this production run."""
        return random.random() < self.sample_rate

    def record_production_run(
        self,
        trajectory: AgentTrajectory,
        task: str
    ) -> None:
        """Optionally evaluate and record a production run."""
        if not self.should_evaluate():
            return

        judge_result = llm_judge_trajectory(task, trajectory)
        score = judge_result.get("overall_score", 5) / 10

        self._recent_scores.append(score)
        if len(self._recent_scores) > self.eval_backlog_size:
            self._recent_scores.pop(0)

        # Alert if quality drops
        if len(self._recent_scores) >= 20:
            recent_avg = sum(self._recent_scores[-20:]) / 20
            if recent_avg < self._alert_threshold:
                self._trigger_alert(recent_avg)

    def _trigger_alert(self, score: float) -> None:
        """Send alert when agent quality drops."""
        import logging
        logging.critical(
            f"AGENT QUALITY ALERT: Rolling avg score {score:.2f} "
            f"below threshold {self._alert_threshold:.2f}. "
            f"Check for model regression or broken tools."
        )

    def get_quality_dashboard(self) -> dict:
        """Get current quality metrics for a monitoring dashboard."""
        if not self._recent_scores:
            return {"status": "no_data"}

        recent = self._recent_scores[-100:]
        return {
            "avg_score_last_100": sum(recent) / len(recent),
            "min_score_last_100": min(recent),
            "p25_score": sorted(recent)[len(recent) // 4],
            "total_evaluated": len(self._recent_scores),
        }

Common Mistakes

:::danger Evaluating Only the Final Answer An agent that gets the right answer by hallucinating in the middle is not a reliable agent - it happened to be right this time. Always evaluate the trajectory, not just the output. If you can only evaluate outputs, at least check that all claimed facts appear in tool results. :::

:::danger Evaluating With Only One Run Due to sampling randomness, a single run evaluation has high variance. Always run at least 3-5 evaluations per case and report the mean and standard deviation. An agent with 80% average and 30% std dev is very different from one with 80% average and 5% std dev. :::

:::warning Using the Evaluation Set for Training/Tuning As soon as you tune your agent based on eval results, the evaluation set is contaminated. Separate your data into development eval (for tuning) and holdout eval (for final reporting). Never tune on holdout. :::

:::warning Not Measuring Latency and Cost A task completion rate of 90% is meaningless without knowing the cost to achieve it. 90% completion at $5 per task is very different from 90% at$ 0.10. Always report cost and latency alongside quality metrics. :::

Interview Q&A

Q: Why is agent evaluation harder than evaluating static LLM outputs?

Three main reasons: (1) Non-determinism - agents use sampling, so the same agent run twice may produce different outputs. You need multiple runs per case. (2) Multi-step structure - the final answer may be correct for wrong reasons. You need to evaluate the trajectory, not just the output. (3) Open-ended outputs - there are many valid ways to answer most agent tasks. Exact-match evaluation fails almost completely. You need semantic comparison or LLM-as-judge.

Q: What is trajectory evaluation and how does it differ from final answer evaluation?

Final answer evaluation checks only whether the last output is correct. Trajectory evaluation checks whether each step in the agent's reasoning was valid: did it correctly identify what information it needed? Did it pick the right tool? Did it correctly interpret the tool result? Trajectory evaluation catches errors in the middle of the reasoning chain that happen to cancel out by the end, and identifies which specific step caused a failure.

Q: How do you use LLM-as-judge for agent evaluation?

You pass the full agent trajectory (task, tool calls with inputs/outputs, and final answer) to a strong LLM and ask it to score the agent's performance against a rubric. The rubric should include: tool selection appropriateness, reasoning quality, answer accuracy, and efficiency. Use a different model as judge than the model being evaluated to avoid self-serving biases. Always get at least 2-3 independent judge scores and average them.

Q: What are the main agent benchmark suites and what do they measure?

WebArena measures realistic web navigation on real websites. AgentBench covers 8 distinct environments (code, databases, web, etc.) to measure generalization. GAIA measures multi-step reasoning on real-world questions requiring tool use. SWE-Bench measures software engineering ability on real GitHub issues. Each tests a different capability profile. A strong general-purpose agent should score well on all of them; specialized agents may excel on one.

Q: How do you build a regression testing system for agents?

Record "golden trajectories" - the full tool call sequences and final answers from known-good agent runs on your test cases. When a model update ships or a prompt changes, re-run those test cases and compare the new trajectories against the golden ones. Look for: significant changes in tool call count, tool sequence divergence in the first few steps (a good proxy for plan quality), and answer semantic similarity below a threshold. Alert on regressions before they reach production.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Agent Evaluation & Trajectory Scoring demo on the EngineersOfAI Playground - no code required.

:::

A Production Scenario​

Why Agent Evaluation Is Hard​

Non-Determinism​

Multi-Step Processes​

Open-Ended Outputs​

The Distribution Gap​

Evaluation Dimensions​

1. Task Completion Rate​

2. Tool Efficiency​

3. LLM-as-Judge for Reasoning Quality​

Building an Evaluation Harness​

Standard Benchmarks​

WebArena​

AgentBench​

GAIA​

SWE-Bench​

Golden Trajectory Testing​

Failure Mode Taxonomy​

Production Monitoring​

Common Mistakes​

Interview Q&A​