What is ML evaluation metrics?

How to evaluate multi-step agent trajectories. Task completion, path quality, error recovery, efficiency, and LLM-as-judge. Benchmarks and trajectory scorers.

How does model accuracy work in practice?

06 - Evaluation of Long-Horizon Tasks covers ML evaluation metrics, model accuracy, F1 score from first principles with code examples. Free lesson at https://engineersofai.com/docs/agentic-ai/long-horizon-planning/evaluation-of-long-horizon-tasks

What is the difference between ML evaluation metrics and F1 score?

See the full breakdown at https://engineersofai.com/docs/agentic-ai/long-horizon-planning/evaluation-of-long-horizon-tasks

Evaluation of Long-Horizon Tasks

The Measurement Problem

You have built an agent. It runs for 45 minutes, makes 67 tool calls, processes 12 documents, and produces a final report. Was it good?

The question is harder than it looks. The final output might be excellent even if the agent took 20 unnecessary steps. Or the output might be technically correct but miss the user's actual intent. Or the agent may have recovered brilliantly from two failures mid-run, demonstrating robust behavior - or failed to recover from a third, demonstrating fragility.

Standard evaluation metrics from NLP (BLEU, ROUGE, accuracy) do not capture any of this. They evaluate a single output. They have no model of the process that produced it.

Evaluating long-horizon agents requires measuring trajectories - the full sequence of states, actions, and observations - not just the final output.

:::tip 🎮 Interactive Playground Visualize this concept: Try the Agent Evaluation demo on the EngineersOfAI Playground - no code required. :::

Why Standard Evaluation Fails

The specific problems with binary evaluation for agent trajectories:

The multiple valid paths problem: for most real tasks, many different action sequences lead to the correct answer. An agent that finds the goal in 5 steps is not better than one that finds it in 8 steps if we only score binary success.

The partial progress problem: an agent that completes 9 of 10 subtasks before failing at the last one has done useful work. Binary failure scores ignore this entirely.

The recovery credit problem: two agents both fail at step 6, but agent A recovers gracefully and completes the task while agent B restarts from scratch and wastes resources. Binary success scores give both the same rating.

The efficiency problem: an agent that completes a task with 10 tool calls is meaningfully better than one that completes the same task with 50 calls, but pure binary evaluation misses this.

Evaluation Dimensions

1. Task Completion

Did the agent achieve the goal? This has two variants:

Binary completion: yes or no. Simple but loses all nuance.

Partial credit: what percentage of the goal was achieved? Requires decomposing the task into subtasks and checking each independently.

def compute_task_completion(goal_subtasks: list[str], achieved: list[str]) -> float:
    """
    Partial credit completion score.
    Returns 0.0–1.0 where 1.0 = all subtasks achieved.
    """
    if not goal_subtasks:
        return 1.0
    matched = sum(1 for task in goal_subtasks if any(is_similar(task, a) for a in achieved))
    return matched / len(goal_subtasks)

2. Efficiency

How many steps/tokens/dollars did it cost?

Step efficiency: steps taken / minimum possible steps (from critical path)
Token efficiency: total tokens used / estimated minimum tokens
Cost efficiency: actual cost / estimated minimum cost for the task

An efficiency score of 1.0 means optimal. Above 1.0 means wasted work.

3. Path Quality

Were the intermediate steps sensible? This requires either:

A reference trajectory (gold standard) to compare against
An LLM judge that scores the coherence and sensibility of each step

Path quality captures things binary completion misses: did the agent take an unnecessarily roundabout route? Did it redo completed work? Did it explore dead ends it should have recognized quickly?

4. Error Recovery

When failures occurred, how did the agent handle them?

Recognized the failure and replanned (good)
Retried with the same strategy (neutral)
Ignored the failure and continued as if it succeeded (bad)
Crashed without any recovery (worst)

5. Human Alignment

Does the final output match what the user actually wanted? This often diverges from "did it technically succeed at the stated goal." A coding agent that writes syntactically correct but poorly structured code technically succeeds but fails on alignment.

Full Implementation: Trajectory Evaluator

"""
trajectory_evaluator.py
Multi-dimensional trajectory evaluation for long-horizon agent tasks.

Scores:
  - Task completion (partial credit)
  - Efficiency (steps, tokens, cost)
  - Path quality (LLM judge)
  - Error recovery
  - Human alignment (LLM judge)

Requirements:
    pip install openai pydantic
"""

from __future__ import annotations

import json
import textwrap
import time
from dataclasses import dataclass, field
from enum import Enum
from typing import Any, Optional

from openai import OpenAI
from pydantic import BaseModel, Field

client = OpenAI()


# ─── Trajectory Data Models ───────────────────────────────────────────────────

class ActionType(str, Enum):
    TOOL_CALL = "tool_call"
    LLM_REASONING = "llm_reasoning"
    HUMAN_INPUT = "human_input"
    ERROR = "error"
    REPLAN = "replan"


class TrajectoryStep(BaseModel):
    """A single step in the agent trajectory."""
    step_number: int
    action_type: ActionType
    action_name: str
    arguments: dict[str, Any] = Field(default_factory=dict)
    observation: Optional[str] = None    # Result/response
    error: Optional[str] = None
    tokens_used: int = 0
    cost_usd: float = 0.0
    duration_ms: float = 0.0
    timestamp: float = Field(default_factory=time.time)
    reasoning: Optional[str] = None     # Agent's reasoning at this step
    was_necessary: Optional[bool] = None  # For manual annotation


class AgentTrajectory(BaseModel):
    """Complete record of an agent run."""
    run_id: str
    goal: str
    steps: list[TrajectoryStep]
    final_output: Optional[str] = None
    task_succeeded: Optional[bool] = None  # Ground truth
    total_tokens: int = 0
    total_cost_usd: float = 0.0
    total_duration_seconds: float = 0.0
    failure_points: list[int] = Field(default_factory=list)  # Step indices of failures
    recovery_points: list[int] = Field(default_factory=list)  # Steps where recovery happened
    metadata: dict[str, Any] = Field(default_factory=dict)

    @property
    def total_steps(self) -> int:
        return len(self.steps)

    @property
    def error_count(self) -> int:
        return sum(1 for s in self.steps if s.error)

    @property
    def recovery_count(self) -> int:
        return len(self.recovery_points)

    def get_step_sequence(self) -> str:
        """Human-readable step sequence for LLM evaluation."""
        lines = []
        for step in self.steps:
            line = f"Step {step.step_number}: {step.action_type.value} - {step.action_name}"
            if step.arguments:
                args_str = ", ".join(f"{k}={str(v)[:30]}" for k, v in list(step.arguments.items())[:3])
                line += f"({args_str})"
            if step.error:
                line += f" → ERROR: {step.error[:50]}"
            elif step.observation:
                line += f" → {step.observation[:60]}"
            lines.append(line)
        return "\n".join(lines)


# ─── Evaluation Scores ────────────────────────────────────────────────────────

@dataclass
class CompletionScore:
    binary: bool
    partial_credit: float          # 0.0–1.0
    completed_subtasks: list[str]
    incomplete_subtasks: list[str]
    notes: str = ""


@dataclass
class EfficiencyScore:
    step_efficiency: float         # actual_steps / min_steps (lower = better... but here normalized)
    token_efficiency: float        # 0.0–1.0 (higher = more efficient)
    cost_efficiency: float         # 0.0–1.0 (higher = more efficient)
    unnecessary_steps: int         # Steps that could be eliminated
    redundant_calls: int           # Same tool+args called twice unnecessarily
    notes: str = ""


@dataclass
class PathQualityScore:
    overall: float                 # 0.0–1.0
    coherence: float               # Were steps logically ordered?
    relevance: float               # Were steps relevant to the goal?
    groundedness: float            # Did the agent respond correctly to observations?
    notes: str = ""


@dataclass
class RecoveryScore:
    error_count: int
    recovered_count: int
    recovery_quality: float        # 0.0–1.0 (how well did recoveries work?)
    fatal_errors: int              # Errors that should have been recoverable but weren't
    notes: str = ""


@dataclass
class AlignmentScore:
    score: float                   # 0.0–1.0
    intent_match: float            # Did the output match what was intended?
    quality: float                 # Is the output high quality?
    completeness: float            # Is the output complete?
    notes: str = ""


@dataclass
class TrajectoryEvaluationResult:
    trajectory: AgentTrajectory
    completion: CompletionScore
    efficiency: EfficiencyScore
    path_quality: PathQualityScore
    recovery: RecoveryScore
    alignment: AlignmentScore
    composite_score: float         # Weighted aggregate
    grade: str                     # A/B/C/D/F

    def to_report(self) -> str:
        lines = [
            f"\n{'='*60}",
            f"TRAJECTORY EVALUATION REPORT",
            f"{'='*60}",
            f"Run ID:     {self.trajectory.run_id}",
            f"Goal:       {self.trajectory.goal[:60]}",
            f"Steps:      {self.trajectory.total_steps}",
            f"Errors:     {self.trajectory.error_count}",
            f"Cost:       ${self.trajectory.total_cost_usd:.4f}",
            f"Duration:   {self.trajectory.total_duration_seconds:.1f}s",
            f"",
            f"SCORES",
            f"{'─'*40}",
            f"Task completion:    {self.completion.partial_credit:.0%}  (binary: {'✅' if self.completion.binary else '❌'})",
            f"Efficiency:         {self.efficiency.token_efficiency:.0%}",
            f"Path quality:       {self.path_quality.overall:.0%}",
            f"Error recovery:     {self.recovery.recovery_quality:.0%}",
            f"Human alignment:    {self.alignment.score:.0%}",
            f"",
            f"COMPOSITE SCORE:    {self.composite_score:.0%}  [{self.grade}]",
            f"{'='*60}",
        ]
        if self.completion.incomplete_subtasks:
            lines.extend(["", "Incomplete subtasks:"])
            for t in self.completion.incomplete_subtasks:
                lines.append(f"  ✗ {t}")
        if self.path_quality.notes:
            lines.extend(["", f"Path notes: {self.path_quality.notes}"])
        if self.alignment.notes:
            lines.extend(["", f"Alignment notes: {self.alignment.notes}"])
        return "\n".join(lines)


# ─── LLM-Based Evaluators ─────────────────────────────────────────────────────

PATH_QUALITY_PROMPT = textwrap.dedent("""
You are an expert evaluator of AI agent trajectories.

Goal: {goal}

Agent trajectory (step sequence):
{step_sequence}

Final output: {final_output}

Evaluate the PROCESS (not just the outcome) on three dimensions:

1. **Coherence** (0.0-1.0): Were steps logically ordered? Did each step build on previous results?
2. **Relevance** (0.0-1.0): Were all steps relevant to the goal? Were there unnecessary detours?
3. **Groundedness** (0.0-1.0): Did the agent correctly interpret and use tool results?

Respond with JSON:
{{
  "coherence": 0.85,
  "relevance": 0.72,
  "groundedness": 0.90,
  "overall": 0.82,
  "notes": "Agent made good use of search results but took two unnecessary clarification steps"
}}
""").strip()

ALIGNMENT_PROMPT = textwrap.dedent("""
You are evaluating whether an AI agent's output matches the user's actual intent.

Original goal: {goal}
Final output: {final_output}

Evaluate on three dimensions:

1. **Intent match** (0.0-1.0): Does the output address what the user actually wanted?
2. **Quality** (0.0-1.0): Is the output high quality and well-executed?
3. **Completeness** (0.0-1.0): Is the output complete, or are parts missing?

Respond with JSON:
{{
  "intent_match": 0.90,
  "quality": 0.75,
  "completeness": 0.85,
  "score": 0.83,
  "notes": "Output correctly addresses the goal but the documentation section is thin"
}}
""").strip()


def evaluate_path_quality(trajectory: AgentTrajectory, model: str = "gpt-4o") -> PathQualityScore:
    """Use LLM-as-judge to evaluate the quality of the agent's reasoning path."""
    response = client.chat.completions.create(
        model=model,
        messages=[{
            "role": "user",
            "content": PATH_QUALITY_PROMPT.format(
                goal=trajectory.goal,
                step_sequence=trajectory.get_step_sequence(),
                final_output=trajectory.final_output or "(none)",
            ),
        }],
        temperature=0.1,
        response_format={"type": "json_object"},
    )
    data = json.loads(response.choices[0].message.content)
    return PathQualityScore(
        overall=data["overall"],
        coherence=data["coherence"],
        relevance=data["relevance"],
        groundedness=data["groundedness"],
        notes=data.get("notes", ""),
    )


def evaluate_alignment(trajectory: AgentTrajectory, model: str = "gpt-4o") -> AlignmentScore:
    """Use LLM-as-judge to evaluate alignment between output and user intent."""
    if not trajectory.final_output:
        return AlignmentScore(score=0.0, intent_match=0.0, quality=0.0, completeness=0.0)

    response = client.chat.completions.create(
        model=model,
        messages=[{
            "role": "user",
            "content": ALIGNMENT_PROMPT.format(
                goal=trajectory.goal,
                final_output=trajectory.final_output,
            ),
        }],
        temperature=0.1,
        response_format={"type": "json_object"},
    )
    data = json.loads(response.choices[0].message.content)
    return AlignmentScore(
        score=data["score"],
        intent_match=data["intent_match"],
        quality=data["quality"],
        completeness=data["completeness"],
        notes=data.get("notes", ""),
    )


# ─── Deterministic Evaluators ─────────────────────────────────────────────────

def evaluate_completion(
    trajectory: AgentTrajectory,
    expected_subtasks: list[str],
) -> CompletionScore:
    """
    Check which expected subtasks were completed.
    Uses keyword matching - for production, use embedding similarity.
    """
    if not expected_subtasks:
        # No expected subtasks defined - use binary success flag
        return CompletionScore(
            binary=trajectory.task_succeeded or False,
            partial_credit=1.0 if trajectory.task_succeeded else 0.0,
            completed_subtasks=[],
            incomplete_subtasks=[],
        )

    step_text = trajectory.get_step_sequence().lower()
    final_text = (trajectory.final_output or "").lower()
    full_text = step_text + " " + final_text

    completed = []
    incomplete = []
    for subtask in expected_subtasks:
        # Simple keyword matching - replace with embedding similarity in production
        keywords = [w.lower() for w in subtask.split() if len(w) > 3]
        if sum(1 for kw in keywords if kw in full_text) >= len(keywords) * 0.6:
            completed.append(subtask)
        else:
            incomplete.append(subtask)

    partial_credit = len(completed) / len(expected_subtasks)
    binary = partial_credit >= 0.9  # >90% of subtasks = success

    return CompletionScore(
        binary=binary,
        partial_credit=partial_credit,
        completed_subtasks=completed,
        incomplete_subtasks=incomplete,
    )


def evaluate_efficiency(
    trajectory: AgentTrajectory,
    estimated_min_steps: int,
    estimated_min_tokens: int,
    estimated_min_cost: float,
) -> EfficiencyScore:
    """Compute efficiency scores relative to estimated minimums."""
    actual_steps = trajectory.total_steps
    actual_tokens = trajectory.total_tokens
    actual_cost = trajectory.total_cost_usd

    # Step efficiency: what fraction of steps were "extra"?
    step_efficiency = min(1.0, estimated_min_steps / max(actual_steps, 1))

    # Token efficiency
    token_efficiency = min(1.0, estimated_min_tokens / max(actual_tokens, 1))

    # Cost efficiency
    cost_efficiency = min(1.0, estimated_min_cost / max(actual_cost, 0.0001))

    # Count obviously unnecessary steps (same action+args called twice)
    seen_calls: set[str] = set()
    redundant = 0
    for step in trajectory.steps:
        if step.action_type == ActionType.TOOL_CALL:
            call_key = f"{step.action_name}:{json.dumps(step.arguments, sort_keys=True)}"
            if call_key in seen_calls:
                redundant += 1
            seen_calls.add(call_key)

    unnecessary = max(0, actual_steps - estimated_min_steps)

    return EfficiencyScore(
        step_efficiency=step_efficiency,
        token_efficiency=token_efficiency,
        cost_efficiency=cost_efficiency,
        unnecessary_steps=unnecessary,
        redundant_calls=redundant,
    )


def evaluate_recovery(trajectory: AgentTrajectory) -> RecoveryScore:
    """Evaluate how well the agent handled errors."""
    errors = [s for s in trajectory.steps if s.error]
    error_count = len(errors)

    if error_count == 0:
        return RecoveryScore(
            error_count=0,
            recovered_count=0,
            recovery_quality=1.0,
            fatal_errors=0,
            notes="No errors encountered",
        )

    # Count recoveries: a recovery is when an error step is followed by eventual success
    recovered = len(trajectory.recovery_points)
    fatal = error_count - recovered

    # Recovery quality: what fraction of errors were recovered from?
    recovery_rate = recovered / max(error_count, 1)

    # Bonus for clean recovery (trajectory still completed)
    if trajectory.task_succeeded and recovered > 0:
        recovery_quality = min(1.0, recovery_rate * 1.1)  # Small bonus for completing despite errors
    else:
        recovery_quality = recovery_rate * 0.8  # Penalize if task failed despite "recovery"

    return RecoveryScore(
        error_count=error_count,
        recovered_count=recovered,
        recovery_quality=recovery_quality,
        fatal_errors=fatal,
        notes=f"{recovered}/{error_count} errors recovered; {fatal} fatal",
    )


# ─── Main Evaluator ───────────────────────────────────────────────────────────

class TrajectoryEvaluator:
    """
    Full trajectory evaluator combining LLM-judge and deterministic metrics.
    """

    # Weights for composite score
    WEIGHTS = {
        "completion": 0.35,
        "efficiency": 0.15,
        "path_quality": 0.20,
        "recovery": 0.15,
        "alignment": 0.15,
    }

    def __init__(
        self,
        model: str = "gpt-4o",
        expected_subtasks: Optional[list[str]] = None,
        min_steps: int = 10,
        min_tokens: int = 5000,
        min_cost: float = 0.05,
    ):
        self.model = model
        self.expected_subtasks = expected_subtasks or []
        self.min_steps = min_steps
        self.min_tokens = min_tokens
        self.min_cost = min_cost

    def evaluate(self, trajectory: AgentTrajectory) -> TrajectoryEvaluationResult:
        """Run full evaluation on a trajectory."""
        print(f"\n📊 Evaluating trajectory {trajectory.run_id[:12]}...")
        print(f"   Steps: {trajectory.total_steps}, Errors: {trajectory.error_count}")

        # Deterministic evaluations (fast)
        completion = evaluate_completion(trajectory, self.expected_subtasks)
        efficiency = evaluate_efficiency(trajectory, self.min_steps, self.min_tokens, self.min_cost)
        recovery = evaluate_recovery(trajectory)

        # LLM-based evaluations (slower, more nuanced)
        print("   Running LLM-based path quality evaluation...")
        path_quality = evaluate_path_quality(trajectory, self.model)

        print("   Running LLM-based alignment evaluation...")
        alignment = evaluate_alignment(trajectory, self.model)

        # Composite score
        composite = (
            completion.partial_credit * self.WEIGHTS["completion"] +
            efficiency.token_efficiency * self.WEIGHTS["efficiency"] +
            path_quality.overall * self.WEIGHTS["path_quality"] +
            recovery.recovery_quality * self.WEIGHTS["recovery"] +
            alignment.score * self.WEIGHTS["alignment"]
        )

        grade = self._grade(composite)

        result = TrajectoryEvaluationResult(
            trajectory=trajectory,
            completion=completion,
            efficiency=efficiency,
            path_quality=path_quality,
            recovery=recovery,
            alignment=alignment,
            composite_score=composite,
            grade=grade,
        )

        print(result.to_report())
        return result

    def _grade(self, score: float) -> str:
        if score >= 0.90: return "A"
        if score >= 0.80: return "B"
        if score >= 0.70: return "C"
        if score >= 0.60: return "D"
        return "F"

    def compare_trajectories(
        self,
        trajectories: list[AgentTrajectory],
    ) -> list[TrajectoryEvaluationResult]:
        """Compare multiple agent runs on the same goal."""
        results = [self.evaluate(t) for t in trajectories]
        results.sort(key=lambda r: r.composite_score, reverse=True)

        print("\n" + "=" * 60)
        print("TRAJECTORY COMPARISON")
        print("=" * 60)
        for i, result in enumerate(results, 1):
            print(
                f"{i}. [{result.grade}] {result.trajectory.run_id[:12]} "
                f"- {result.composite_score:.0%} "
                f"({result.trajectory.total_steps} steps, "
                f"${result.trajectory.total_cost_usd:.4f})"
            )
        return results


# ─── Demo ─────────────────────────────────────────────────────────────────────

def create_sample_trajectory(
    run_id: str,
    goal: str,
    succeed: bool = True,
    n_errors: int = 0,
) -> AgentTrajectory:
    """Create a synthetic trajectory for demonstration."""
    steps = [
        TrajectoryStep(
            step_number=1, action_type=ActionType.LLM_REASONING,
            action_name="plan", tokens_used=500, cost_usd=0.001,
            reasoning="Breaking down the goal into subtasks",
            observation="Plan created with 8 steps",
        ),
        TrajectoryStep(
            step_number=2, action_type=ActionType.TOOL_CALL,
            action_name="read_file", arguments={"path": "src/main.py"},
            tokens_used=200, cost_usd=0.0004,
            observation="FastAPI app with 3 routes found",
        ),
        TrajectoryStep(
            step_number=3, action_type=ActionType.TOOL_CALL,
            action_name="search_web",
            arguments={"query": "FastAPI testing best practices 2024"},
            tokens_used=300, cost_usd=0.0006,
            observation="Retrieved pytest-asyncio and httpx patterns",
        ),
    ]

    if n_errors > 0:
        steps.append(TrajectoryStep(
            step_number=4, action_type=ActionType.ERROR,
            action_name="run_command",
            arguments={"cmd": "pytest --cov"},
            tokens_used=100, cost_usd=0.0002,
            error="ModuleNotFoundError: pytest-asyncio not installed",
        ))
        steps.append(TrajectoryStep(
            step_number=5, action_type=ActionType.TOOL_CALL,
            action_name="run_command",
            arguments={"cmd": "pip install pytest-asyncio"},
            tokens_used=100, cost_usd=0.0002,
            observation="Successfully installed pytest-asyncio",
            reasoning="Installing missing dependency to resolve error",
        ))

    steps.extend([
        TrajectoryStep(
            step_number=len(steps)+1, action_type=ActionType.TOOL_CALL,
            action_name="write_file",
            arguments={"path": "tests/test_main.py", "content": "...test code..."},
            tokens_used=800, cost_usd=0.0016,
            observation="Test file written with 12 test cases",
        ),
        TrajectoryStep(
            step_number=len(steps)+2, action_type=ActionType.TOOL_CALL,
            action_name="run_command",
            arguments={"cmd": "pytest --cov=src tests/"},
            tokens_used=200, cost_usd=0.0004,
            observation="12 passed, coverage: 87%",
        ),
    ])

    total_tokens = sum(s.tokens_used for s in steps)
    total_cost = sum(s.cost_usd for s in steps)

    return AgentTrajectory(
        run_id=run_id,
        goal=goal,
        steps=steps,
        final_output="Tests written with 87% coverage. All 12 tests pass." if succeed else None,
        task_succeeded=succeed,
        total_tokens=total_tokens,
        total_cost_usd=total_cost,
        total_duration_seconds=45.0,
        failure_points=[3] if n_errors > 0 else [],
        recovery_points=[4] if n_errors > 0 else [],
    )


def demo_evaluation():
    evaluator = TrajectoryEvaluator(
        model="gpt-4o",
        expected_subtasks=[
            "Read existing code",
            "Identify testing patterns",
            "Write test file",
            "Run tests",
            "Achieve 80% coverage",
        ],
        min_steps=5,
        min_tokens=1500,
        min_cost=0.003,
    )

    goal = "Add unit tests to our FastAPI application with at least 80% code coverage"

    # Evaluate a trajectory with error recovery
    traj = create_sample_trajectory("run_001", goal, succeed=True, n_errors=1)
    result = evaluator.evaluate(traj)


if __name__ == "__main__":
    demo_evaluation()

Benchmarks for Long-Horizon Tasks

Benchmark	Task Type	Key Metric	Notes
GAIA	General assistant tasks	Completion rate	3 levels of difficulty; 466 tasks
WebArena	Web navigation	Task success	Real browser, real websites
tau-bench	Tool-use agent tasks	Success with noise	Tests robustness to realistic errors
AgentBench	Multi-domain tasks	Normalized score	8 different environments
SWE-bench	Software engineering	Patch success %	Real GitHub issues
MINT	Multi-turn instruction	Completion + efficiency	Interactive coding tasks

The key insight from these benchmarks: agent success rates drop sharply with task complexity. On GAIA Level 3 (hard) tasks, even top models score below 40%. Long-horizon tasks remain an open research problem.

The Oracle Problem

Ground truth for long-horizon tasks is hard to define. For a coding task, multiple valid implementations exist. For a research task, multiple valid conclusions exist. For a business analysis, multiple valid recommendations exist.

Strategies for handling the oracle problem:

Subtask-level evaluation: decompose the goal into concrete subtasks that each have objective outcomes. "Wrote a test file" is checkable even if the specific tests vary.
LLM-as-judge with rubric: provide an explicit scoring rubric to the LLM judge, not just "is this good?"
Human reference trajectories: have human experts complete the same tasks and use their trajectories as reference. Measure deviation from reference.
Functional testing: for code tasks, run tests. The tests are the oracle.
Multi-judge consensus: use multiple LLMs as judges, take the average or majority vote.

Production Notes

:::warning LLM-as-Judge Biases LLM judges have known biases: they favor verbose outputs over concise ones, they prefer outputs that sound confident, and they are inconsistent across runs. Always use structured rubrics, run judges multiple times and average, and validate your judge against human evaluations on a held-out set before trusting it in production. :::

:::danger Evaluating Your Evaluator The hardest part of trajectory evaluation is knowing whether your evaluator is accurate. Before deploying a trajectory evaluator, validate it: run it on 20–30 trajectories that humans have manually scored, and check the correlation. If the evaluator disagrees with humans more than 20% of the time, it needs calibration. :::

Regression testing: once you have a trajectory evaluator, use it for regression testing. Every time you change the agent, run it on a held-out benchmark set and compare scores to the previous version. An evaluation score decrease of >5% should block deployment.

Interview Questions and Answers

Q: Why is evaluating agent trajectories fundamentally different from evaluating a single LLM response?

A: A single response evaluation is a point measurement - you check whether the output is correct, relevant, and high quality. A trajectory evaluation is a path measurement - you must assess whether the sequence of actions was reasonable, whether the agent made good decisions at branching points, and whether it recovered appropriately from failures. The key differences: trajectories have multiple valid paths (no single "correct" answer), they have partial credit (completing 8 of 10 subtasks is meaningfully better than completing 3), and they have emergent quality properties like efficiency and error recovery that do not appear in single-output evaluation.

Q: What is LLM-as-judge and what are its limitations?

A: LLM-as-judge uses a language model to score outputs - you show the LLM the agent's trajectory and output, ask it to score on specific dimensions, and use those scores as evaluation metrics. It scales well and handles nuanced quality dimensions that rule-based metrics cannot. Limitations: (1) position bias - LLMs tend to prefer the first or last item when comparing options; (2) verbosity bias - longer answers score higher even when shorter answers are better; (3) inconsistency - the same evaluation run twice may give different scores; (4) self-preference - a GPT-4 judge will slightly favor GPT-4 outputs. I mitigate these with structured rubrics, running multiple evaluation passes, and validating against human ground truth.

Q: How do you measure efficiency in agent trajectories?

A: Three dimensions: step efficiency (actual steps / estimated minimum steps - how much extra work was done?), token efficiency (actual tokens / estimated minimum tokens - was the agent verbose or concise?), and cost efficiency (actual cost / estimated minimum cost). I also count redundant tool calls (same action with same arguments called twice) as a signal of poor planning. Efficiency matters because an agent that achieves the same outcome with half the resources is strictly better in production.

Q: What benchmarks exist for evaluating long-horizon agents and what do their results show?

A: The key benchmarks are GAIA (general tasks, 3 difficulty levels), SWE-bench (real GitHub software engineering issues), WebArena (web navigation), and tau-bench (tool-use with realistic noise). The consistent finding across all of them is that performance degrades significantly with task complexity and horizon length. On GAIA Level 3, state-of-the-art models score below 40%. On SWE-bench, even the best systems resolve under 50% of issues. This tells us long-horizon tasks remain an open problem - current agents are much better at short, well-defined tasks than at complex, multi-step ones.

Q: How do you handle the oracle problem - the fact that many valid agent paths exist for the same goal?

A: I decompose the evaluation into components that can be assessed independently of which specific path was taken. For completion, I check whether each required outcome was achieved, not whether a specific method was used. For quality, I use LLM judges with rubrics that assess the output quality independently of how it was produced. For efficiency, I estimate minimum reasonable steps based on the task structure and measure relative to that. For some task types (code), functional tests serve as the oracle - if the tests pass, the outcome is correct regardless of implementation. The key insight is to evaluate outcomes and quality rather than path fidelity.

The Measurement Problem​

Why Standard Evaluation Fails​

Evaluation Dimensions​

1. Task Completion​

2. Efficiency​

3. Path Quality​

4. Error Recovery​

5. Human Alignment​

Full Implementation: Trajectory Evaluator​

Benchmarks for Long-Horizon Tasks​

The Oracle Problem​

Production Notes​

Interview Questions and Answers​