Challenges of Evaluating Agents
The Question Nobody Wants to Answerā
You deploy an agent. It handles customer support queries, autonomously researches topics, writes and executes code, or manages complex workflows. Users are using it. The product is live.
Is it good?
How would you know?
There is no assert agent.output == expected for complex tasks. You cannot write a unit test that checks whether a customer support agent resolved a complaint well. You cannot write an assertion that confirms an autonomous researcher found the most relevant papers. The gap between "the agent produced output" and "the agent did its job well" is enormous - and most teams never close it.
This lesson is about understanding exactly why that gap exists, why it is harder to close than it seems, and how to build the engineering discipline to close it anyway.
:::tip š® Interactive Playground Visualize this concept: Try the Agent Evaluation demo on the EngineersOfAI Playground - no code required. :::
Why This Existsā
The Model Evaluation Illusionā
Before agents, evaluation was almost easy. You had a model. You had a labeled test set. You computed accuracy, F1, BLEU, or ROUGE and declared victory. These metrics had real problems - BLEU scores for translation famously correlate poorly with human judgment - but the framework was clear: inputs go in, outputs come out, outputs get compared to labels, number emerges.
This framework breaks completely for agents. The assumptions it rests on - that there is one correct output, that inputs and outputs are independent, that evaluation is cheap - all fail simultaneously.
The field is still catching up. As of 2025, there is no agreed standard for agent evaluation the way ImageNet was a standard for image classification. Different labs use different benchmarks. Different companies use different internal metrics. This is not a solved problem. It is an active research and engineering challenge.
Understanding why it is hard is the first step toward solving it.
The Multiple Valid Paths Problemā
Ask an agent: "Find the three most relevant papers on retrieval-augmented generation from 2023 and summarize their key contributions."
What is the correct answer?
There are hundreds of relevant papers. Any three of the top-tier ones would be defensible. The summaries could emphasize different aspects. The agent could use web search, or a paper API, or a vector database. It could find papers in any order. It could produce the summaries in any format.
Every one of these choices produces a different trajectory and a different output. Almost none of them is clearly wrong. Almost all of them are defensible. There is no single ground truth.
This is the multiple valid paths problem. For any non-trivial task, there exist dozens to thousands of correct trajectories and outputs. An evaluation metric that penalizes the agent for not matching one specific reference output will measure nothing useful. An evaluation metric that accepts any output accepts nonsense too.
The solution is not to find the "right" answer and compare against it. The solution is to define the properties a good answer must have - and evaluate those properties instead.
Why Agent Evaluation Differs From Model Evaluationā
| Dimension | Model Evaluation | Agent Evaluation |
|---|---|---|
| Output space | Fixed (class, token, embedding) | Open-ended (text, actions, tool calls) |
| Ground truth | Single label per example | Multiple valid outputs |
| Dependencies | Each example is independent | Steps within a trajectory depend on previous steps |
| Cost | Cheap (forward pass) | Expensive (multiple API calls, real tools) |
| Determinism | Mostly deterministic | Non-deterministic (temperature, tool results) |
| Failure modes | Misclassification, hallucination | Compound errors, infinite loops, wrong tool use |
| Time | Milliseconds | Seconds to minutes |
The most important row is dependencies. When you evaluate a sentiment classifier, each example is independent. When you evaluate an agent, step 3's input depends on the result of step 2, which depends on step 1. An error in step 2 changes everything that follows. You cannot evaluate each step independently and combine the scores - the interactions are what matter.
The Compound Error Problemā
Imagine a research agent with 10 steps:
- Parse the user's question
- Decompose into sub-questions
- Search for relevant papers
- Filter by relevance
- Extract key claims from each paper
- Cross-reference claims
- Identify conflicts or gaps
- Synthesize an answer
- Format the response
- Generate citations
If step 3 returns slightly wrong papers - papers that are adjacent to the topic but not quite right - every downstream step operates on flawed inputs. By step 8, the synthesis is fundamentally wrong. The final output may look polished and confident. It will be wrong.
The agent "succeeded" at every step in isolation. The compound effect of a small error in step 3 produced a failure.
This is the compound error problem. Attribution is nearly impossible: which step failed? The step that returned slightly wrong papers? Or the step that did not detect this and flag it? Or the step that should have searched for more papers to cross-check?
Practical implication: evaluate at the trajectory level, not the step level. The question is not "was step 3 good?" but "did the trajectory as a whole lead to a good outcome?"
Latent Failures: When Metrics Lieā
The most dangerous failure mode in agent evaluation is latent failure: the agent scores well on your metrics but fails in practice.
A customer support agent has a "task completion rate" metric. A task is marked complete when the agent sends a final response. The metric shows 94% completion. Excellent.
What the metric does not capture: 31% of completed tasks required the user to follow up with a correction. 15% of completions involved the agent apologizing and escalating to a human. 8% involved the agent providing incorrect information confidently.
The metric was optimized, not the behavior.
Latent failures arise from a mismatch between your proxy metric and the true goal. Common examples:
- Completion rate measured as "agent produced a final output" - but outputs can be wrong
- Accuracy measured on a curated test set - but the test set does not represent production queries
- User satisfaction measured via thumbs up/down - but users often do not rate, and those who do are not representative
- Latency measured as time to first token - but the agent may still be thinking for 30 more seconds
The lesson: always question what your metric actually measures. Work backward from the true goal. If your true goal is "users successfully accomplish their task," every proxy metric is one step removed from that truth.
Distribution Shift: Eval Set vs. Productionā
You build your eval set by collecting examples in February. You ship in April. By July, users are asking questions you never anticipated. Your eval score is still 87%. Your production quality has dropped to 71%.
Distribution shift is when the data your agent sees in production differs from the data you evaluated on. For agents, this is particularly vicious because:
- User behavior adapts to the agent - they learn what it is good at and route other tasks elsewhere
- The world changes - new events, new information, new tools
- The agent's deployment context changes - new integrations, new user segments, new use cases
- Adversarial users appear - people who probe the agent's weaknesses
Production distribution will always differ from your eval distribution. The correct response is not to build a bigger eval set. It is to continuously collect production traces, sample them, and use them to update your eval set. The eval set must evolve with the agent's environment.
The Cost of Evaluationā
Running a static image classifier on 10,000 test examples costs fractions of a cent. Running an agent on 10,000 test examples costs hundreds of dollars and takes hours.
A single agent run might involve:
- 5ā30 LLM API calls (0.50 each for capable models)
- 5ā20 tool calls (web search, code execution, database queries)
- 30 seconds to 5 minutes of wall-clock time
At scale:
- 100 eval examples: 100, 1ā2 hours
- 1000 eval examples: 1000, 10ā20 hours
- 10,000 eval examples: unaffordable for most teams
This forces hard tradeoffs. You cannot exhaustively evaluate every agent change. You must be strategic: small regression test suites for fast iteration, larger comprehensive suites for release decisions, and production monitoring as a continuous signal.
Human Evaluation at Scale: Slow, Expensive, Necessaryā
Human evaluation is the gold standard - ultimately, agents serve humans, so human judgment is the closest proxy to real quality. But human evaluation has severe practical limitations:
- Speed: A human annotator evaluates 10ā50 agent outputs per hour. An LLM evaluates thousands per minute.
- Cost: Human annotation costs 10 per example, depending on complexity and expertise required.
- Consistency: Different annotators disagree. Even the same annotator disagrees with themselves 10ā20% of the time on subjective tasks.
- Scalability: You cannot run human eval on every commit, every experiment, every model version.
Human evaluation is not optional - but it cannot be your primary evaluation signal. The practical approach:
- Use human evaluation to calibrate automated metrics
- Use human evaluation to validate before major releases
- Use human evaluation to investigate anomalies caught by automated metrics
- Use LLM-as-judge for continuous automated evaluation
- Use production monitoring for real-time signal
The Evaluation Pyramidā
Think of agent evaluation as a pyramid - five layers, each with different characteristics:
Unit tests (base of pyramid): fast, cheap, many. Test individual functions - tool parsers, prompt formatters, response extractors. Run on every commit. Catches implementation bugs.
Integration tests: end-to-end agent runs on a small curated test set (20ā50 examples). Real tools, real API calls, real outputs. Run per PR or daily. Catches regressions in agent behavior.
LLM-as-judge: automated quality scoring of agent trajectories and outputs. Scalable - can evaluate hundreds of examples overnight. Run weekly or before releases. Catches quality regressions that unit/integration tests miss.
Human evaluation: periodic structured evaluation by human annotators. Small sample (50ā200 examples). Run quarterly or before major releases. The quality ground truth.
Production monitoring: continuous measurement of agent behavior in production. The ultimate real-world signal. Catches failure modes that evals never anticipated.
Each layer informs the layers below. A production anomaly becomes a new integration test. A pattern of human eval failures becomes a new LLM-judge rubric criterion.
Dimensions to Evaluateā
No single metric captures agent quality. You need a multi-dimensional evaluation:
| Dimension | Question | Metric Type |
|---|---|---|
| Task completion | Did the agent accomplish the goal? | Binary or graded (0ā1) |
| Output quality | Is the final output correct and useful? | Rubric-based score |
| Trajectory efficiency | Was the path to the answer reasonable? | Steps taken / minimum steps |
| Tool precision | Did the agent use the right tools correctly? | Precision/recall |
| Error recovery | When something went wrong, did the agent recover? | Steps to recovery |
| Safety | Did the agent avoid harmful outputs or actions? | Binary policy checks |
| Latency | How long did it take? | Wall-clock time, p50/p95/p99 |
| Cost | How much did it cost? | Total tokens and API costs |
Not all dimensions matter equally for every use case. A coding assistant prioritizes task completion and code correctness. A customer support agent prioritizes user satisfaction and safety. A research agent prioritizes output quality and information coverage.
Define your evaluation dimensions before building your agent, not after.
Building an Evaluation Mindsetā
The hardest part of agent evaluation is not technical - it is the discipline to define success before you build, and to measure it honestly.
Ask yourself these questions before shipping any agent:
-
What does success look like? Write a specific, measurable definition. "The agent helps users" is not a definition. "The agent produces a correct, complete answer to 80% of queries, as judged by a domain expert on a 1ā5 scale, with 4 or above counted as success" is a definition.
-
What evidence would convince you the agent works? If you ran 100 random production queries and a domain expert rated them, what score would be acceptable? If you cannot answer this, you do not have a success criterion.
-
What failure modes are you afraid of? Write them down. Then build tests that specifically probe for them.
-
How will you detect regressions? If the agent gets worse after a model update or prompt change, how will you know within 24 hours?
-
How will your evaluation signal improve over time? A static eval set degrades. What is the process for keeping it fresh?
Full Python: Evaluation Harness Skeletonā
"""
Agent evaluation harness with configurable metrics.
Run any agent against a task set, collect trajectories, compute scores.
"""
import asyncio
import json
import time
import uuid
from dataclasses import dataclass, field
from enum import Enum
from typing import Any, Callable, Optional
import anthropic
client = anthropic.Anthropic()
# āā Data models āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
class TaskStatus(str, Enum):
PENDING = "pending"
RUNNING = "running"
COMPLETED = "completed"
FAILED = "failed"
TIMEOUT = "timeout"
@dataclass
class EvalTask:
"""A single evaluation task."""
task_id: str
query: str
expected_output: Optional[str] = None # may be None for open-ended tasks
expected_tool_calls: Optional[list] = None # optional: tools that should be used
difficulty: str = "medium" # easy / medium / hard
category: str = "general"
metadata: dict = field(default_factory=dict)
@dataclass
class TrajectoryStep:
"""One step in an agent trajectory."""
step_number: int
step_type: str # "llm_call", "tool_call", "observation"
input_tokens: int
output_tokens: int
tool_name: Optional[str] = None
tool_input: Optional[dict] = None
tool_output: Optional[str] = None
llm_response: Optional[str] = None
duration_ms: float = 0.0
error: Optional[str] = None
@dataclass
class EvalResult:
"""Result of evaluating one task."""
task_id: str
run_id: str
status: TaskStatus
final_output: Optional[str]
trajectory: list[TrajectoryStep]
total_input_tokens: int
total_output_tokens: int
total_duration_ms: float
total_tool_calls: int
error_count: int
metrics: dict[str, float] = field(default_factory=dict)
judge_scores: dict[str, float] = field(default_factory=dict)
metadata: dict = field(default_factory=dict)
@property
def total_steps(self) -> int:
return len(self.trajectory)
@property
def estimated_cost_usd(self) -> float:
# Claude Sonnet pricing (approximate)
input_cost = (self.total_input_tokens / 1_000_000) * 3.0
output_cost = (self.total_output_tokens / 1_000_000) * 15.0
return input_cost + output_cost
# āā Agent wrapper āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
class TracingAgent:
"""
Wraps any agent function with trajectory recording.
The agent function receives a query and returns (final_output, trajectory).
"""
def __init__(self, tools: list[dict], system_prompt: str, max_steps: int = 20):
self.tools = tools
self.system_prompt = system_prompt
self.max_steps = max_steps
def run(self, query: str) -> tuple[Optional[str], list[TrajectoryStep]]:
"""Run the agent, returning (final_output, trajectory)."""
messages = [{"role": "user", "content": query}]
trajectory: list[TrajectoryStep] = []
step_number = 0
while step_number < self.max_steps:
step_number += 1
t0 = time.time()
# LLM call
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=4096,
system=self.system_prompt,
tools=self.tools,
messages=messages,
)
duration_ms = (time.time() - t0) * 1000
step = TrajectoryStep(
step_number=step_number,
step_type="llm_call",
input_tokens=response.usage.input_tokens,
output_tokens=response.usage.output_tokens,
llm_response=self._extract_text(response),
duration_ms=duration_ms,
)
trajectory.append(step)
# Check stop condition
if response.stop_reason == "end_turn":
final_text = self._extract_text(response)
return final_text, trajectory
# Process tool calls
if response.stop_reason == "tool_use":
tool_calls = [b for b in response.content if b.type == "tool_use"]
if not tool_calls:
return self._extract_text(response), trajectory
# Add assistant turn
messages.append({"role": "assistant", "content": response.content})
# Execute tools
tool_results = []
for tc in tool_calls:
step_number += 1
t0 = time.time()
result, error = self._execute_tool(tc.name, tc.input)
tool_duration = (time.time() - t0) * 1000
tool_step = TrajectoryStep(
step_number=step_number,
step_type="tool_call",
input_tokens=0,
output_tokens=0,
tool_name=tc.name,
tool_input=tc.input,
tool_output=result,
duration_ms=tool_duration,
error=error,
)
trajectory.append(tool_step)
tool_results.append({
"type": "tool_result",
"tool_use_id": tc.id,
"content": result if result else f"Error: {error}",
})
messages.append({"role": "user", "content": tool_results})
return None, trajectory # Hit max_steps
def _extract_text(self, response) -> Optional[str]:
for block in response.content:
if hasattr(block, "text"):
return block.text
return None
def _execute_tool(self, name: str, tool_input: dict) -> tuple[Optional[str], Optional[str]]:
"""Execute a tool. Override in subclasses with real implementations."""
return f"[Mock result for tool={name} input={tool_input}]", None
# āā Evaluation harness āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
class EvaluationHarness:
"""
Runs an agent against a task set and computes configurable metrics.
"""
def __init__(
self,
agent: TracingAgent,
metrics: list[Callable[[EvalTask, EvalResult], float]],
timeout_seconds: float = 120.0,
):
self.agent = agent
self.metrics = metrics
self.timeout_seconds = timeout_seconds
def run_single(self, task: EvalTask) -> EvalResult:
"""Run one task and return the result."""
run_id = str(uuid.uuid4())[:8]
t0 = time.time()
try:
final_output, trajectory = self._run_with_timeout(task.query)
status = TaskStatus.COMPLETED if final_output else TaskStatus.FAILED
except TimeoutError:
final_output = None
trajectory = []
status = TaskStatus.TIMEOUT
except Exception as exc:
final_output = None
trajectory = []
status = TaskStatus.FAILED
print(f"Task {task.task_id} failed: {exc}")
total_duration = (time.time() - t0) * 1000
result = EvalResult(
task_id=task.task_id,
run_id=run_id,
status=status,
final_output=final_output,
trajectory=trajectory,
total_input_tokens=sum(s.input_tokens for s in trajectory),
total_output_tokens=sum(s.output_tokens for s in trajectory),
total_duration_ms=total_duration,
total_tool_calls=sum(1 for s in trajectory if s.step_type == "tool_call"),
error_count=sum(1 for s in trajectory if s.error is not None),
)
# Compute metrics
for metric_fn in self.metrics:
try:
score = metric_fn(task, result)
result.metrics[metric_fn.__name__] = score
except Exception as e:
print(f"Metric {metric_fn.__name__} failed: {e}")
result.metrics[metric_fn.__name__] = -1.0
return result
def run_suite(self, tasks: list[EvalTask], max_workers: int = 4) -> list[EvalResult]:
"""Run all tasks, with simple sequential execution."""
results = []
for i, task in enumerate(tasks):
print(f"Running task {i+1}/{len(tasks)}: {task.task_id}")
result = self.run_single(task)
results.append(result)
print(f" Status: {result.status.value} | "
f"Steps: {result.total_steps} | "
f"Cost: ${result.estimated_cost_usd:.4f}")
return results
def _run_with_timeout(
self, query: str
) -> tuple[Optional[str], list[TrajectoryStep]]:
"""Run agent with a wall-clock timeout."""
import signal
def handler(signum, frame):
raise TimeoutError()
signal.signal(signal.SIGALRM, handler)
signal.alarm(int(self.timeout_seconds))
try:
result = self.agent.run(query)
signal.alarm(0)
return result
except TimeoutError:
signal.alarm(0)
raise
def summarize(self, results: list[EvalResult]) -> dict:
"""Compute aggregate statistics across all results."""
if not results:
return {}
completed = [r for r in results if r.status == TaskStatus.COMPLETED]
completion_rate = len(completed) / len(results)
all_metrics = {}
for metric_name in (results[0].metrics or {}).keys():
values = [r.metrics[metric_name] for r in results if metric_name in r.metrics]
if values:
all_metrics[metric_name] = {
"mean": sum(values) / len(values),
"min": min(values),
"max": max(values),
}
return {
"total_tasks": len(results),
"completed": len(completed),
"failed": sum(1 for r in results if r.status == TaskStatus.FAILED),
"timeout": sum(1 for r in results if r.status == TaskStatus.TIMEOUT),
"completion_rate": completion_rate,
"avg_steps": sum(r.total_steps for r in results) / len(results),
"avg_duration_ms": sum(r.total_duration_ms for r in results) / len(results),
"avg_cost_usd": sum(r.estimated_cost_usd for r in results) / len(results),
"total_cost_usd": sum(r.estimated_cost_usd for r in results),
"metrics": all_metrics,
}
# āā Built-in metric functions āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
def completion_rate_metric(task: EvalTask, result: EvalResult) -> float:
"""1.0 if task completed, 0.0 if failed/timeout."""
return 1.0 if result.status == TaskStatus.COMPLETED else 0.0
def tool_error_rate_metric(task: EvalTask, result: EvalResult) -> float:
"""Fraction of tool calls that resulted in errors. Lower is better."""
if result.total_tool_calls == 0:
return 0.0
return result.error_count / result.total_tool_calls
def step_count_metric(task: EvalTask, result: EvalResult) -> float:
"""Normalized step count. 1.0 is perfect (1 step), 0.0 is many steps."""
if not result.trajectory:
return 0.0
# Normalize: 1 step = 1.0, 20+ steps = 0.0
return max(0.0, 1.0 - (result.total_steps - 1) / 20.0)
def cost_efficiency_metric(task: EvalTask, result: EvalResult) -> float:
"""Cost efficiency. $0 = 1.0, $1.00+ = 0.0."""
cost = result.estimated_cost_usd
return max(0.0, 1.0 - cost)
# āā Demo āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
def demo():
# Minimal tools for demo
tools = [
{
"name": "web_search",
"description": "Search the web for information.",
"input_schema": {
"type": "object",
"properties": {"query": {"type": "string"}},
"required": ["query"],
},
}
]
agent = TracingAgent(
tools=tools,
system_prompt="You are a helpful research assistant. Use tools when needed.",
max_steps=10,
)
tasks = [
EvalTask(
task_id="task_001",
query="What is the capital of France?",
expected_output="Paris",
difficulty="easy",
category="factual",
),
EvalTask(
task_id="task_002",
query="Explain the transformer architecture in 3 sentences.",
difficulty="medium",
category="explanation",
),
]
harness = EvaluationHarness(
agent=agent,
metrics=[
completion_rate_metric,
tool_error_rate_metric,
step_count_metric,
cost_efficiency_metric,
],
timeout_seconds=60,
)
results = harness.run_suite(tasks)
summary = harness.summarize(results)
print("\nāā Evaluation Summary āāāāāāāāāāāāāāāāāāāāāāāāā")
print(json.dumps(summary, indent=2))
for result in results:
print(f"\nTask {result.task_id}:")
print(f" Status: {result.status.value}")
print(f" Steps: {result.total_steps}")
print(f" Cost: ${result.estimated_cost_usd:.4f}")
print(f" Metrics: {result.metrics}")
if __name__ == "__main__":
demo()
Production Engineering Notesā
Isolate Your Eval Environmentā
Never run eval against your production API endpoints or databases. Mistakes in eval can corrupt production data, exhaust rate limits, or trigger real-world actions (emails sent, payments processed). Use sandbox environments with mock tool implementations for eval.
Version Your Eval Setsā
Store eval tasks in version control alongside the code they test. When you change the agent, update the eval set. When you find a production failure, add a corresponding eval task. The eval set is a living document.
Track Baselinesā
Every time you run eval, store the results with the model version, prompt hash, and timestamp. Without baselines, you cannot detect regressions. A score of 87% is meaningless unless you know the previous score was 91%.
Make Eval Fast Enough to Run Oftenā
If eval takes 8 hours, it will not be run before every release. Design a "fast eval" suite (50 examples, 10 minutes) for daily use and a "comprehensive eval" suite (500 examples, 2 hours) for weekly or pre-release use.
:::danger Common Mistake: Evaluating Output, Not Impact The most common agent evaluation mistake is measuring output quality instead of user impact. An agent that produces a technically correct answer in a format the user cannot use has failed. Always tie evaluation to the user's actual goal, not the agent's intermediate output. :::
:::warning Distribution Shift is Silent Your eval score will not drop when your production distribution shifts. The eval set just becomes less representative. Build a pipeline that continuously samples production traces and adds them to your eval set. Without this, your eval score will steadily diverge from your real quality. :::
:::tip Start With the Failure Cases The most valuable eval examples are the ones your agent currently fails. When you find a production failure, add it to your eval set immediately. A failure-focused eval set catches regressions more reliably than a balanced one. :::
Interview Q&Aā
Q: Why is evaluating agents fundamentally harder than evaluating static models?
A: Three core reasons. First, agents have multi-step trajectories with dependencies between steps - you cannot evaluate each step independently because a small error in step 2 changes everything downstream. Second, most agent tasks have multiple valid outputs and paths, so there is no single ground truth to compare against. Third, agent evaluation is expensive - each run costs real tokens and time - which constrains how many examples you can evaluate and how often. Static models have single-step independent predictions with clear ground truth labels and near-zero evaluation cost.
Q: What is the compound error problem in agent evaluation?
A: When an agent makes a small mistake in an early step, that mistake propagates through all subsequent steps. The final output may be confidently wrong, even though each individual step, evaluated in isolation, looks reasonable. This makes attribution very difficult - you know the final output is wrong, but you cannot easily identify which step caused the failure. It also means step-level evaluation metrics can be misleading: good scores at each step do not guarantee a good final output.
Q: Explain the evaluation pyramid for agents.
A: The evaluation pyramid has five layers, each with different characteristics. At the base, unit tests check individual components (tool parsers, prompt formatters) - fast, cheap, run on every commit. Integration tests run end-to-end agent trajectories on a small curated set - run per PR or daily. LLM-as-judge provides automated quality scoring at scale - run weekly or pre-release. Human evaluation is the highest-quality signal - run quarterly or before major releases. At the top, production monitoring provides continuous real-world signal. Each layer informs the ones below: a production anomaly becomes an integration test, a pattern of human eval failures becomes a judge rubric.
Q: What is a latent failure in agent evaluation, and how do you detect it?
A: A latent failure is when an agent scores well on your evaluation metrics but fails in practice. For example, a support agent might have a 94% "task completion rate" measured by whether it produces a final response - but 30% of those completions might be wrong answers delivered confidently. Latent failures arise from a mismatch between your proxy metric and the true goal. Detection requires multi-dimensional evaluation: do not rely on a single metric. Measure task completion, output quality, user correction rate, and escalation rate together. When metrics diverge from each other, investigate.
Q: How would you design an eval strategy for a new agent with no existing eval set?
A: I would start by defining success criteria with domain experts - what does a good output look like? What does a bad one look like? Then collect 50ā100 representative production queries (or synthetic ones if production is not available yet), covering typical cases, edge cases, and known difficult scenarios. For each task, I would note the expected properties of a good answer (not a specific answer). I would run the agent on all tasks, review the outputs manually to establish a baseline quality score, and identify the top failure modes. Those failure modes become the first regression tests. Then I would add an LLM-as-judge evaluation for scalable automated scoring, calibrated against my manual scores. Finally, I would set up production monitoring to detect drift and continuously add production failures to the eval set.
