What is agent evaluation challenges?

Why evaluating agentic systems is fundamentally harder than evaluating static models - the multi-path problem, compound errors, latent failures, and how to build an evaluation mindset.

How does multi-step evaluation work in practice?

Challenges of Evaluating Agents covers agent evaluation challenges, multi-step evaluation, non-deterministic evaluation from first principles with code examples. Free lesson at https://engineersofai.com/docs/agentic-ai/agent-evaluation/challenges-of-evaluating-agents

What is the difference between agent evaluation challenges and non-deterministic evaluation?

See the full breakdown at https://engineersofai.com/docs/agentic-ai/agent-evaluation/challenges-of-evaluating-agents

Challenges of Evaluating Agents

The Question Nobody Wants to Answer

You deploy an agent. It handles customer support queries, autonomously researches topics, writes and executes code, or manages complex workflows. Users are using it. The product is live.

Is it good?

How would you know?

There is no assert agent.output == expected for complex tasks. You cannot write a unit test that checks whether a customer support agent resolved a complaint well. You cannot write an assertion that confirms an autonomous researcher found the most relevant papers. The gap between "the agent produced output" and "the agent did its job well" is enormous - and most teams never close it.

This lesson is about understanding exactly why that gap exists, why it is harder to close than it seems, and how to build the engineering discipline to close it anyway.

:::tip 🎮 Interactive Playground Visualize this concept: Try the Agent Evaluation demo on the EngineersOfAI Playground - no code required. :::

Why This Exists

The Model Evaluation Illusion

Before agents, evaluation was almost easy. You had a model. You had a labeled test set. You computed accuracy, F1, BLEU, or ROUGE and declared victory. These metrics had real problems - BLEU scores for translation famously correlate poorly with human judgment - but the framework was clear: inputs go in, outputs come out, outputs get compared to labels, number emerges.

This framework breaks completely for agents. The assumptions it rests on - that there is one correct output, that inputs and outputs are independent, that evaluation is cheap - all fail simultaneously.

The field is still catching up. As of 2025, there is no agreed standard for agent evaluation the way ImageNet was a standard for image classification. Different labs use different benchmarks. Different companies use different internal metrics. This is not a solved problem. It is an active research and engineering challenge.

Understanding why it is hard is the first step toward solving it.

The Multiple Valid Paths Problem

Ask an agent: "Find the three most relevant papers on retrieval-augmented generation from 2023 and summarize their key contributions."

What is the correct answer?

There are hundreds of relevant papers. Any three of the top-tier ones would be defensible. The summaries could emphasize different aspects. The agent could use web search, or a paper API, or a vector database. It could find papers in any order. It could produce the summaries in any format.

Every one of these choices produces a different trajectory and a different output. Almost none of them is clearly wrong. Almost all of them are defensible. There is no single ground truth.

This is the multiple valid paths problem. For any non-trivial task, there exist dozens to thousands of correct trajectories and outputs. An evaluation metric that penalizes the agent for not matching one specific reference output will measure nothing useful. An evaluation metric that accepts any output accepts nonsense too.

The solution is not to find the "right" answer and compare against it. The solution is to define the properties a good answer must have - and evaluate those properties instead.

Why Agent Evaluation Differs From Model Evaluation

Dimension	Model Evaluation	Agent Evaluation
Output space	Fixed (class, token, embedding)	Open-ended (text, actions, tool calls)
Ground truth	Single label per example	Multiple valid outputs
Dependencies	Each example is independent	Steps within a trajectory depend on previous steps
Cost	Cheap (forward pass)	Expensive (multiple API calls, real tools)
Determinism	Mostly deterministic	Non-deterministic (temperature, tool results)
Failure modes	Misclassification, hallucination	Compound errors, infinite loops, wrong tool use
Time	Milliseconds	Seconds to minutes

The most important row is dependencies. When you evaluate a sentiment classifier, each example is independent. When you evaluate an agent, step 3's input depends on the result of step 2, which depends on step 1. An error in step 2 changes everything that follows. You cannot evaluate each step independently and combine the scores - the interactions are what matter.

The Compound Error Problem

Imagine a research agent with 10 steps:

Parse the user's question
Decompose into sub-questions
Search for relevant papers
Filter by relevance
Extract key claims from each paper
Cross-reference claims
Identify conflicts or gaps
Synthesize an answer
Format the response
Generate citations

If step 3 returns slightly wrong papers - papers that are adjacent to the topic but not quite right - every downstream step operates on flawed inputs. By step 8, the synthesis is fundamentally wrong. The final output may look polished and confident. It will be wrong.

The agent "succeeded" at every step in isolation. The compound effect of a small error in step 3 produced a failure.

This is the compound error problem. Attribution is nearly impossible: which step failed? The step that returned slightly wrong papers? Or the step that did not detect this and flag it? Or the step that should have searched for more papers to cross-check?

Practical implication: evaluate at the trajectory level, not the step level. The question is not "was step 3 good?" but "did the trajectory as a whole lead to a good outcome?"

Latent Failures: When Metrics Lie

The most dangerous failure mode in agent evaluation is latent failure: the agent scores well on your metrics but fails in practice.

A customer support agent has a "task completion rate" metric. A task is marked complete when the agent sends a final response. The metric shows 94% completion. Excellent.

What the metric does not capture: 31% of completed tasks required the user to follow up with a correction. 15% of completions involved the agent apologizing and escalating to a human. 8% involved the agent providing incorrect information confidently.

The metric was optimized, not the behavior.

Latent failures arise from a mismatch between your proxy metric and the true goal. Common examples:

Completion rate measured as "agent produced a final output" - but outputs can be wrong
Accuracy measured on a curated test set - but the test set does not represent production queries
User satisfaction measured via thumbs up/down - but users often do not rate, and those who do are not representative
Latency measured as time to first token - but the agent may still be thinking for 30 more seconds

The lesson: always question what your metric actually measures. Work backward from the true goal. If your true goal is "users successfully accomplish their task," every proxy metric is one step removed from that truth.

Distribution Shift: Eval Set vs. Production

You build your eval set by collecting examples in February. You ship in April. By July, users are asking questions you never anticipated. Your eval score is still 87%. Your production quality has dropped to 71%.

Distribution shift is when the data your agent sees in production differs from the data you evaluated on. For agents, this is particularly vicious because:

User behavior adapts to the agent - they learn what it is good at and route other tasks elsewhere
The world changes - new events, new information, new tools
The agent's deployment context changes - new integrations, new user segments, new use cases
Adversarial users appear - people who probe the agent's weaknesses

Production distribution will always differ from your eval distribution. The correct response is not to build a bigger eval set. It is to continuously collect production traces, sample them, and use them to update your eval set. The eval set must evolve with the agent's environment.

The Cost of Evaluation

Running a static image classifier on 10,000 test examples costs fractions of a cent. Running an agent on 10,000 test examples costs hundreds of dollars and takes hours.

A single agent run might involve:

5–30 LLM API calls ( $0.01–$ 0.50 each for capable models)
5–20 tool calls (web search, code execution, database queries)
30 seconds to 5 minutes of wall-clock time

At scale:

100 eval examples: $10–$ 100, 1–2 hours
1000 eval examples: $100–$ 1000, 10–20 hours
10,000 eval examples: unaffordable for most teams

This forces hard tradeoffs. You cannot exhaustively evaluate every agent change. You must be strategic: small regression test suites for fast iteration, larger comprehensive suites for release decisions, and production monitoring as a continuous signal.

Human Evaluation at Scale: Slow, Expensive, Necessary

Human evaluation is the gold standard - ultimately, agents serve humans, so human judgment is the closest proxy to real quality. But human evaluation has severe practical limitations:

Speed: A human annotator evaluates 10–50 agent outputs per hour. An LLM evaluates thousands per minute.
Cost: Human annotation costs $0.10–$ 10 per example, depending on complexity and expertise required.
Consistency: Different annotators disagree. Even the same annotator disagrees with themselves 10–20% of the time on subjective tasks.
Scalability: You cannot run human eval on every commit, every experiment, every model version.

Human evaluation is not optional - but it cannot be your primary evaluation signal. The practical approach:

Use human evaluation to calibrate automated metrics
Use human evaluation to validate before major releases
Use human evaluation to investigate anomalies caught by automated metrics
Use LLM-as-judge for continuous automated evaluation
Use production monitoring for real-time signal

The Evaluation Pyramid

Think of agent evaluation as a pyramid - five layers, each with different characteristics:

Unit tests (base of pyramid): fast, cheap, many. Test individual functions - tool parsers, prompt formatters, response extractors. Run on every commit. Catches implementation bugs.

Integration tests: end-to-end agent runs on a small curated test set (20–50 examples). Real tools, real API calls, real outputs. Run per PR or daily. Catches regressions in agent behavior.

LLM-as-judge: automated quality scoring of agent trajectories and outputs. Scalable - can evaluate hundreds of examples overnight. Run weekly or before releases. Catches quality regressions that unit/integration tests miss.

Human evaluation: periodic structured evaluation by human annotators. Small sample (50–200 examples). Run quarterly or before major releases. The quality ground truth.

Production monitoring: continuous measurement of agent behavior in production. The ultimate real-world signal. Catches failure modes that evals never anticipated.

Each layer informs the layers below. A production anomaly becomes a new integration test. A pattern of human eval failures becomes a new LLM-judge rubric criterion.

Dimensions to Evaluate

No single metric captures agent quality. You need a multi-dimensional evaluation:

Dimension	Question	Metric Type
Task completion	Did the agent accomplish the goal?	Binary or graded (0–1)
Output quality	Is the final output correct and useful?	Rubric-based score
Trajectory efficiency	Was the path to the answer reasonable?	Steps taken / minimum steps
Tool precision	Did the agent use the right tools correctly?	Precision/recall
Error recovery	When something went wrong, did the agent recover?	Steps to recovery
Safety	Did the agent avoid harmful outputs or actions?	Binary policy checks
Latency	How long did it take?	Wall-clock time, p50/p95/p99
Cost	How much did it cost?	Total tokens and API costs

Not all dimensions matter equally for every use case. A coding assistant prioritizes task completion and code correctness. A customer support agent prioritizes user satisfaction and safety. A research agent prioritizes output quality and information coverage.

Define your evaluation dimensions before building your agent, not after.

Building an Evaluation Mindset

The hardest part of agent evaluation is not technical - it is the discipline to define success before you build, and to measure it honestly.

Ask yourself these questions before shipping any agent:

What does success look like? Write a specific, measurable definition. "The agent helps users" is not a definition. "The agent produces a correct, complete answer to 80% of queries, as judged by a domain expert on a 1–5 scale, with 4 or above counted as success" is a definition.
What evidence would convince you the agent works? If you ran 100 random production queries and a domain expert rated them, what score would be acceptable? If you cannot answer this, you do not have a success criterion.
What failure modes are you afraid of? Write them down. Then build tests that specifically probe for them.
How will you detect regressions? If the agent gets worse after a model update or prompt change, how will you know within 24 hours?
How will your evaluation signal improve over time? A static eval set degrades. What is the process for keeping it fresh?

Full Python: Evaluation Harness Skeleton

"""
Agent evaluation harness with configurable metrics.
Run any agent against a task set, collect trajectories, compute scores.
"""

import asyncio
import json
import time
import uuid
from dataclasses import dataclass, field
from enum import Enum
from typing import Any, Callable, Optional
import anthropic

client = anthropic.Anthropic()


# ── Data models ────────────────────────────────────────────────────────────────

class TaskStatus(str, Enum):
    PENDING = "pending"
    RUNNING = "running"
    COMPLETED = "completed"
    FAILED = "failed"
    TIMEOUT = "timeout"


@dataclass
class EvalTask:
    """A single evaluation task."""
    task_id: str
    query: str
    expected_output: Optional[str] = None      # may be None for open-ended tasks
    expected_tool_calls: Optional[list] = None  # optional: tools that should be used
    difficulty: str = "medium"                  # easy / medium / hard
    category: str = "general"
    metadata: dict = field(default_factory=dict)


@dataclass
class TrajectoryStep:
    """One step in an agent trajectory."""
    step_number: int
    step_type: str          # "llm_call", "tool_call", "observation"
    input_tokens: int
    output_tokens: int
    tool_name: Optional[str] = None
    tool_input: Optional[dict] = None
    tool_output: Optional[str] = None
    llm_response: Optional[str] = None
    duration_ms: float = 0.0
    error: Optional[str] = None


@dataclass
class EvalResult:
    """Result of evaluating one task."""
    task_id: str
    run_id: str
    status: TaskStatus
    final_output: Optional[str]
    trajectory: list[TrajectoryStep]
    total_input_tokens: int
    total_output_tokens: int
    total_duration_ms: float
    total_tool_calls: int
    error_count: int
    metrics: dict[str, float] = field(default_factory=dict)
    judge_scores: dict[str, float] = field(default_factory=dict)
    metadata: dict = field(default_factory=dict)

    @property
    def total_steps(self) -> int:
        return len(self.trajectory)

    @property
    def estimated_cost_usd(self) -> float:
        # Claude Sonnet pricing (approximate)
        input_cost = (self.total_input_tokens / 1_000_000) * 3.0
        output_cost = (self.total_output_tokens / 1_000_000) * 15.0
        return input_cost + output_cost


# ── Agent wrapper ──────────────────────────────────────────────────────────────

class TracingAgent:
    """
    Wraps any agent function with trajectory recording.
    The agent function receives a query and returns (final_output, trajectory).
    """

    def __init__(self, tools: list[dict], system_prompt: str, max_steps: int = 20):
        self.tools = tools
        self.system_prompt = system_prompt
        self.max_steps = max_steps

    def run(self, query: str) -> tuple[Optional[str], list[TrajectoryStep]]:
        """Run the agent, returning (final_output, trajectory)."""
        messages = [{"role": "user", "content": query}]
        trajectory: list[TrajectoryStep] = []
        step_number = 0

        while step_number < self.max_steps:
            step_number += 1
            t0 = time.time()

            # LLM call
            response = client.messages.create(
                model="claude-opus-4-6",
                max_tokens=4096,
                system=self.system_prompt,
                tools=self.tools,
                messages=messages,
            )

            duration_ms = (time.time() - t0) * 1000
            step = TrajectoryStep(
                step_number=step_number,
                step_type="llm_call",
                input_tokens=response.usage.input_tokens,
                output_tokens=response.usage.output_tokens,
                llm_response=self._extract_text(response),
                duration_ms=duration_ms,
            )
            trajectory.append(step)

            # Check stop condition
            if response.stop_reason == "end_turn":
                final_text = self._extract_text(response)
                return final_text, trajectory

            # Process tool calls
            if response.stop_reason == "tool_use":
                tool_calls = [b for b in response.content if b.type == "tool_use"]

                if not tool_calls:
                    return self._extract_text(response), trajectory

                # Add assistant turn
                messages.append({"role": "assistant", "content": response.content})

                # Execute tools
                tool_results = []
                for tc in tool_calls:
                    step_number += 1
                    t0 = time.time()
                    result, error = self._execute_tool(tc.name, tc.input)
                    tool_duration = (time.time() - t0) * 1000

                    tool_step = TrajectoryStep(
                        step_number=step_number,
                        step_type="tool_call",
                        input_tokens=0,
                        output_tokens=0,
                        tool_name=tc.name,
                        tool_input=tc.input,
                        tool_output=result,
                        duration_ms=tool_duration,
                        error=error,
                    )
                    trajectory.append(tool_step)

                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": tc.id,
                        "content": result if result else f"Error: {error}",
                    })

                messages.append({"role": "user", "content": tool_results})

        return None, trajectory  # Hit max_steps

    def _extract_text(self, response) -> Optional[str]:
        for block in response.content:
            if hasattr(block, "text"):
                return block.text
        return None

    def _execute_tool(self, name: str, tool_input: dict) -> tuple[Optional[str], Optional[str]]:
        """Execute a tool. Override in subclasses with real implementations."""
        return f"[Mock result for tool={name} input={tool_input}]", None


# ── Evaluation harness ─────────────────────────────────────────────────────────

class EvaluationHarness:
    """
    Runs an agent against a task set and computes configurable metrics.
    """

    def __init__(
        self,
        agent: TracingAgent,
        metrics: list[Callable[[EvalTask, EvalResult], float]],
        timeout_seconds: float = 120.0,
    ):
        self.agent = agent
        self.metrics = metrics
        self.timeout_seconds = timeout_seconds

    def run_single(self, task: EvalTask) -> EvalResult:
        """Run one task and return the result."""
        run_id = str(uuid.uuid4())[:8]
        t0 = time.time()

        try:
            final_output, trajectory = self._run_with_timeout(task.query)
            status = TaskStatus.COMPLETED if final_output else TaskStatus.FAILED
        except TimeoutError:
            final_output = None
            trajectory = []
            status = TaskStatus.TIMEOUT
        except Exception as exc:
            final_output = None
            trajectory = []
            status = TaskStatus.FAILED
            print(f"Task {task.task_id} failed: {exc}")

        total_duration = (time.time() - t0) * 1000

        result = EvalResult(
            task_id=task.task_id,
            run_id=run_id,
            status=status,
            final_output=final_output,
            trajectory=trajectory,
            total_input_tokens=sum(s.input_tokens for s in trajectory),
            total_output_tokens=sum(s.output_tokens for s in trajectory),
            total_duration_ms=total_duration,
            total_tool_calls=sum(1 for s in trajectory if s.step_type == "tool_call"),
            error_count=sum(1 for s in trajectory if s.error is not None),
        )

        # Compute metrics
        for metric_fn in self.metrics:
            try:
                score = metric_fn(task, result)
                result.metrics[metric_fn.__name__] = score
            except Exception as e:
                print(f"Metric {metric_fn.__name__} failed: {e}")
                result.metrics[metric_fn.__name__] = -1.0

        return result

    def run_suite(self, tasks: list[EvalTask], max_workers: int = 4) -> list[EvalResult]:
        """Run all tasks, with simple sequential execution."""
        results = []
        for i, task in enumerate(tasks):
            print(f"Running task {i+1}/{len(tasks)}: {task.task_id}")
            result = self.run_single(task)
            results.append(result)
            print(f"  Status: {result.status.value} | "
                  f"Steps: {result.total_steps} | "
                  f"Cost: ${result.estimated_cost_usd:.4f}")
        return results

    def _run_with_timeout(
        self, query: str
    ) -> tuple[Optional[str], list[TrajectoryStep]]:
        """Run agent with a wall-clock timeout."""
        import signal

        def handler(signum, frame):
            raise TimeoutError()

        signal.signal(signal.SIGALRM, handler)
        signal.alarm(int(self.timeout_seconds))
        try:
            result = self.agent.run(query)
            signal.alarm(0)
            return result
        except TimeoutError:
            signal.alarm(0)
            raise

    def summarize(self, results: list[EvalResult]) -> dict:
        """Compute aggregate statistics across all results."""
        if not results:
            return {}

        completed = [r for r in results if r.status == TaskStatus.COMPLETED]
        completion_rate = len(completed) / len(results)

        all_metrics = {}
        for metric_name in (results[0].metrics or {}).keys():
            values = [r.metrics[metric_name] for r in results if metric_name in r.metrics]
            if values:
                all_metrics[metric_name] = {
                    "mean": sum(values) / len(values),
                    "min": min(values),
                    "max": max(values),
                }

        return {
            "total_tasks": len(results),
            "completed": len(completed),
            "failed": sum(1 for r in results if r.status == TaskStatus.FAILED),
            "timeout": sum(1 for r in results if r.status == TaskStatus.TIMEOUT),
            "completion_rate": completion_rate,
            "avg_steps": sum(r.total_steps for r in results) / len(results),
            "avg_duration_ms": sum(r.total_duration_ms for r in results) / len(results),
            "avg_cost_usd": sum(r.estimated_cost_usd for r in results) / len(results),
            "total_cost_usd": sum(r.estimated_cost_usd for r in results),
            "metrics": all_metrics,
        }


# ── Built-in metric functions ──────────────────────────────────────────────────

def completion_rate_metric(task: EvalTask, result: EvalResult) -> float:
    """1.0 if task completed, 0.0 if failed/timeout."""
    return 1.0 if result.status == TaskStatus.COMPLETED else 0.0


def tool_error_rate_metric(task: EvalTask, result: EvalResult) -> float:
    """Fraction of tool calls that resulted in errors. Lower is better."""
    if result.total_tool_calls == 0:
        return 0.0
    return result.error_count / result.total_tool_calls


def step_count_metric(task: EvalTask, result: EvalResult) -> float:
    """Normalized step count. 1.0 is perfect (1 step), 0.0 is many steps."""
    if not result.trajectory:
        return 0.0
    # Normalize: 1 step = 1.0, 20+ steps = 0.0
    return max(0.0, 1.0 - (result.total_steps - 1) / 20.0)


def cost_efficiency_metric(task: EvalTask, result: EvalResult) -> float:
    """Cost efficiency. $0 = 1.0, $1.00+ = 0.0."""
    cost = result.estimated_cost_usd
    return max(0.0, 1.0 - cost)


# ── Demo ───────────────────────────────────────────────────────────────────────

def demo():
    # Minimal tools for demo
    tools = [
        {
            "name": "web_search",
            "description": "Search the web for information.",
            "input_schema": {
                "type": "object",
                "properties": {"query": {"type": "string"}},
                "required": ["query"],
            },
        }
    ]

    agent = TracingAgent(
        tools=tools,
        system_prompt="You are a helpful research assistant. Use tools when needed.",
        max_steps=10,
    )

    tasks = [
        EvalTask(
            task_id="task_001",
            query="What is the capital of France?",
            expected_output="Paris",
            difficulty="easy",
            category="factual",
        ),
        EvalTask(
            task_id="task_002",
            query="Explain the transformer architecture in 3 sentences.",
            difficulty="medium",
            category="explanation",
        ),
    ]

    harness = EvaluationHarness(
        agent=agent,
        metrics=[
            completion_rate_metric,
            tool_error_rate_metric,
            step_count_metric,
            cost_efficiency_metric,
        ],
        timeout_seconds=60,
    )

    results = harness.run_suite(tasks)
    summary = harness.summarize(results)

    print("\n── Evaluation Summary ─────────────────────────")
    print(json.dumps(summary, indent=2))

    for result in results:
        print(f"\nTask {result.task_id}:")
        print(f"  Status: {result.status.value}")
        print(f"  Steps: {result.total_steps}")
        print(f"  Cost: ${result.estimated_cost_usd:.4f}")
        print(f"  Metrics: {result.metrics}")


if __name__ == "__main__":
    demo()

Production Engineering Notes

Isolate Your Eval Environment

Never run eval against your production API endpoints or databases. Mistakes in eval can corrupt production data, exhaust rate limits, or trigger real-world actions (emails sent, payments processed). Use sandbox environments with mock tool implementations for eval.

Version Your Eval Sets

Store eval tasks in version control alongside the code they test. When you change the agent, update the eval set. When you find a production failure, add a corresponding eval task. The eval set is a living document.

Track Baselines

Every time you run eval, store the results with the model version, prompt hash, and timestamp. Without baselines, you cannot detect regressions. A score of 87% is meaningless unless you know the previous score was 91%.

Make Eval Fast Enough to Run Often

If eval takes 8 hours, it will not be run before every release. Design a "fast eval" suite (50 examples, 10 minutes) for daily use and a "comprehensive eval" suite (500 examples, 2 hours) for weekly or pre-release use.

:::danger Common Mistake: Evaluating Output, Not Impact The most common agent evaluation mistake is measuring output quality instead of user impact. An agent that produces a technically correct answer in a format the user cannot use has failed. Always tie evaluation to the user's actual goal, not the agent's intermediate output. :::

:::warning Distribution Shift is Silent Your eval score will not drop when your production distribution shifts. The eval set just becomes less representative. Build a pipeline that continuously samples production traces and adds them to your eval set. Without this, your eval score will steadily diverge from your real quality. :::

:::tip Start With the Failure Cases The most valuable eval examples are the ones your agent currently fails. When you find a production failure, add it to your eval set immediately. A failure-focused eval set catches regressions more reliably than a balanced one. :::

Interview Q&A

Q: Why is evaluating agents fundamentally harder than evaluating static models?

A: Three core reasons. First, agents have multi-step trajectories with dependencies between steps - you cannot evaluate each step independently because a small error in step 2 changes everything downstream. Second, most agent tasks have multiple valid outputs and paths, so there is no single ground truth to compare against. Third, agent evaluation is expensive - each run costs real tokens and time - which constrains how many examples you can evaluate and how often. Static models have single-step independent predictions with clear ground truth labels and near-zero evaluation cost.

Q: What is the compound error problem in agent evaluation?

A: When an agent makes a small mistake in an early step, that mistake propagates through all subsequent steps. The final output may be confidently wrong, even though each individual step, evaluated in isolation, looks reasonable. This makes attribution very difficult - you know the final output is wrong, but you cannot easily identify which step caused the failure. It also means step-level evaluation metrics can be misleading: good scores at each step do not guarantee a good final output.

Q: Explain the evaluation pyramid for agents.

A: The evaluation pyramid has five layers, each with different characteristics. At the base, unit tests check individual components (tool parsers, prompt formatters) - fast, cheap, run on every commit. Integration tests run end-to-end agent trajectories on a small curated set - run per PR or daily. LLM-as-judge provides automated quality scoring at scale - run weekly or pre-release. Human evaluation is the highest-quality signal - run quarterly or before major releases. At the top, production monitoring provides continuous real-world signal. Each layer informs the ones below: a production anomaly becomes an integration test, a pattern of human eval failures becomes a judge rubric.

Q: What is a latent failure in agent evaluation, and how do you detect it?

A: A latent failure is when an agent scores well on your evaluation metrics but fails in practice. For example, a support agent might have a 94% "task completion rate" measured by whether it produces a final response - but 30% of those completions might be wrong answers delivered confidently. Latent failures arise from a mismatch between your proxy metric and the true goal. Detection requires multi-dimensional evaluation: do not rely on a single metric. Measure task completion, output quality, user correction rate, and escalation rate together. When metrics diverge from each other, investigate.

Q: How would you design an eval strategy for a new agent with no existing eval set?

A: I would start by defining success criteria with domain experts - what does a good output look like? What does a bad one look like? Then collect 50–100 representative production queries (or synthetic ones if production is not available yet), covering typical cases, edge cases, and known difficult scenarios. For each task, I would note the expected properties of a good answer (not a specific answer). I would run the agent on all tasks, review the outputs manually to establish a baseline quality score, and identify the top failure modes. Those failure modes become the first regression tests. Then I would add an LLM-as-judge evaluation for scalable automated scoring, calibrated against my manual scores. Finally, I would set up production monitoring to detect drift and continuously add production failures to the eval set.

The Question Nobody Wants to Answer​

Why This Exists​

The Model Evaluation Illusion​

The Multiple Valid Paths Problem​

Why Agent Evaluation Differs From Model Evaluation​

The Compound Error Problem​

Latent Failures: When Metrics Lie​

Distribution Shift: Eval Set vs. Production​

The Cost of Evaluation​

Human Evaluation at Scale: Slow, Expensive, Necessary​

The Evaluation Pyramid​

Dimensions to Evaluate​

Building an Evaluation Mindset​

Full Python: Evaluation Harness Skeleton​

Production Engineering Notes​

Isolate Your Eval Environment​

Version Your Eval Sets​

Track Baselines​

Make Eval Fast Enough to Run Often​

Interview Q&A​