What is GAIA benchmark?

GAIA tests general-purpose agents on real-world tasks requiring web search, file reading, code execution, and multi-step reasoning. Learn the task structure, scoring, SOTA analysis, and how to build GAIA-style evaluations.

How does general AI assistant evaluation work in practice?

GAIA Benchmark covers GAIA benchmark, general AI assistant evaluation, agent benchmark from first principles with code examples. Free lesson at https://engineersofai.com/docs/agentic-ai/agent-evaluation/gaia-benchmark

What is the difference between GAIA benchmark and agent benchmark?

See the full breakdown at https://engineersofai.com/docs/agentic-ai/agent-evaluation/gaia-benchmark

GAIA Benchmark

The Test That Humbles Agents

You have spent weeks tuning your agent. It handles your internal test cases with 90% accuracy. You feel good about it.

Then you try GAIA.

GAIA - General AI Assistants - tests agents on real-world tasks that seem simple but require precise multi-step reasoning, actual web browsing, real file reading, and cross-source fact verification. Tasks like: "What is the total number of points scored in the Super Bowl games played in New Orleans?" or "In the Wikipedia article about the Eiffel Tower, what is the 7th word of the third paragraph?" These questions have exact, verifiable answers. They require real tool use, real information retrieval, and careful reasoning. They are hard.

In 2024, the best models scored around 30% on Level 3 tasks. GPT-4 with browsing: 15%. Humans: 92%.

GAIA exists precisely to measure this gap - and to provide a rigorous, meaningful target for agent improvement.

:::tip 🎮 Interactive Playground Visualize this concept: Try the Agent Evaluation demo on the EngineersOfAI Playground - no code required. :::

Why GAIA Was Created

The Problem With Simple Benchmarks

Before GAIA, common agent benchmarks had a serious problem: the best approaches involved pattern matching, prompt tricks, or memorized answers, rather than genuine multi-step reasoning. Models would score highly not by being good agents but by being good at the specific format of the benchmark.

GAIA was designed by researchers at HuggingFace, Meta, and HEC Paris (Mialon et al., 2023) to resist this. The design principles:

Real answers from real sources. Not questions whose answers appear prominently in training data - questions requiring actual retrieval and computation.
Multi-step necessity. No single search or tool call is sufficient. Tasks are structured to require 3–30+ steps.
Exact answers. Not subjective quality - exact match with normalization. The answer is either right or wrong.
Diverse tools. Different tasks require different combinations of web search, file reading, code execution, image analysis, and multi-hop reasoning.

The result is a benchmark where strong performance genuinely indicates a capable general-purpose agent.

GAIA Task Structure

GAIA has 450 tasks in the public validation set, organized into three levels of difficulty:

Level 1 Examples

Simple retrieval with one or two steps:

"What is the population of Iceland according to its Wikipedia article?"
"In the paper 'Attention Is All You Need', what year was it published?"
"Calculate 15% of 847."

These require a single tool call (web search or file read) plus extraction of the answer. A capable agent should score 70-80% on Level 1.

Level 2 Examples

Multi-source, multi-hop, cross-referencing:

"What is the sum of the populations of the three countries that border France that are not in the G7?"
"Looking at the publicly available budget spreadsheet for the city of Austin, Texas, what was the total capital expenditure in 2022 in millions?"
"According to the IMDB page for the movie released in 1994 that shares its name with an Eminem album, who directed it?"

These require 5–15 steps: finding sources, reading them, computing derived quantities, cross-referencing facts.

Level 3 Examples

Complex, adversarial, requiring precise reasoning and many tools:

"Find the Wikipedia article that describes the first joint space mission between the US and the Soviet Union. In that article, what is the third sentence of the 'Mission profile' section, and how many words does that sentence contain?"
Multi-file analysis tasks where an agent must read, parse, and synthesize information from attached files (PDFs, spreadsheets, images)

Level 3 tests the full depth of agent capability. As of 2025, state-of-the-art agents score 25–35%.

What GAIA Tests

Capability	Description	Required for Level
Web search	Finding information via search engine	1, 2, 3
Web navigation	Following links, reading pages	2, 3
File reading	PDF, spreadsheet, image parsing	2, 3
Code execution	Running Python for computation	2, 3
Multi-hop reasoning	Chaining facts across sources	2, 3
Arithmetic	Exact numerical computation	1, 2, 3
Fact verification	Checking claims against sources	2, 3
Multi-modal	Image and document understanding	3

GAIA's diversity is one of its strengths. An agent that is good at web search but poor at file reading will score poorly. An agent that is good at retrieval but poor at arithmetic will miss computational questions. High scores require genuine breadth.

GAIA Scoring

Exact Match with Normalization

GAIA uses exact match as the primary scoring criterion, with normalization to handle surface-form variation:

Strip punctuation: Remove trailing periods, commas, parentheses
Normalize numbers: "1,234" == "1234" == "1.234 thousand"
Normalize units: "15 km" == "15 kilometers"
Case-insensitive: "Paris" == "paris"
Article normalization: "The United States" == "United States"

After normalization, the answer either matches or it does not. No partial credit in the standard GAIA evaluation.

Score Computation

$\text{GAIA Score} = \frac{\text{Correct answers}}{\text{Total tasks}}$

Separate scores are reported for each level, plus an overall score. A competitive result requires:

Level 1: 70%+
Level 2: 45%+
Level 3: 25%+
Overall: 50%+

Current SOTA: 2024-2025

Model/System	Level 1	Level 2	Level 3	Overall
Human baseline	94.4%	91.7%	90.5%	92.0%
Best open-source agents (2025)	77%	55%	32%	55%
GPT-4o + advanced tools (2024)	71%	48%	25%	50%
Claude agents (2024)	68%	45%	22%	46%
GPT-4 + browsing (early 2024)	45%	28%	12%	30%
GPT-4 alone (no tools)	15%	8%	4%	10%

Key observations:

Tools are essential - GPT-4 alone scores 10%, with tools 30%+.
The human-AI gap is still enormous at Level 3 (92% vs 32%).
Progress is real - scores improved from 15% to 50%+ overall in 18 months.
Level 3 is the active frontier - most improvement opportunity remains there.

What Makes GAIA Hard

Answers Don't Appear in Top Search Results

For many GAIA tasks, the answer is not in the first search result. The agent must follow links, read secondary sources, and synthesize. An agent that stops at the first plausible-sounding answer fails.

Multi-Hop Chains Break Under Pressure

A 5-hop reasoning chain has a 35% success rate if each hop is 85% reliable. Even small errors at each step compound. Level 3 tasks require 10+ hops.

Precision Matters

"The population is about 350,000" fails if the answer is 348,271. GAIA rewards precision, not approximate reasoning. Agents that round, estimate, or paraphrase instead of looking up exact values score poorly.

Adversarial Phrasing

GAIA questions are designed to test exact comprehension. "What is the 7th word of the third paragraph?" requires counting. "Which countries border France that are NOT in the G7?" requires set logic. These phrasings trip up agents that pattern-match rather than reason carefully.

GAIA vs Other Benchmarks

Benchmark	Focus	Task Type	Answer Format	Tool Diversity
GAIA	General-purpose agents	Real-world research	Exact match	High (web + file + code)
WebArena	Web navigation	GUI interaction	Task success	Low (web only)
SWE-bench	Coding agents	GitHub issues	Test pass rate	Low (code only)
τ-bench	Tool use	API calls	Output match	Medium
AgentBench	General agents	8 environments	Environment-specific	High
MMLU	Knowledge	Multiple choice	Label match	None

When to use GAIA: when evaluating a general-purpose assistant that must search, compute, and reason. Not appropriate for specialized coding, web navigation, or pure knowledge retrieval agents.

When to use SWE-bench: when evaluating coding agents specifically. See the next lesson.

When to use WebArena: when evaluating agents that navigate real websites.

Dataset Access

GAIA is available on the HuggingFace Hub:

from datasets import load_dataset

# Public validation set (answers provided)
dataset = load_dataset("gaia-benchmark/GAIA", "2023_all")

# Splits
validation = dataset["validation"]   # 165+170+115 tasks, answers public
test = dataset["test"]               # answers private, submit to leaderboard

# Task structure
example = validation[0]
print(example.keys())
# dict_keys(['task_id', 'Question', 'Level', 'final_answer',
#            'file_name', 'Annotator Metadata'])

print(f"Level: {example['Level']}")
print(f"Question: {example['Question'][:200]}")
print(f"Expected answer: {example['final_answer']}")

The validation set includes answers, making it suitable for offline development and iteration. The test set requires submission to the leaderboard for scoring.

Running GAIA Locally

Setup

"""
GAIA evaluation harness - run your agent on GAIA tasks and compute scores.
"""

import json
import re
import time
from dataclasses import dataclass
from typing import Optional
import anthropic

try:
    from datasets import load_dataset
    HAS_DATASETS = True
except ImportError:
    HAS_DATASETS = False
    print("Install datasets: pip install datasets")

client = anthropic.Anthropic()


# ── Answer normalization ───────────────────────────────────────────────────────

def normalize_answer(answer: str) -> str:
    """
    Normalize an answer for GAIA exact-match scoring.
    Handles numbers, units, punctuation, and case.
    """
    if not answer:
        return ""

    answer = answer.strip()

    # Remove trailing punctuation
    answer = re.sub(r'[.,;:!?]+$', '', answer)

    # Lowercase
    answer = answer.lower()

    # Remove articles at start
    answer = re.sub(r'^(the|a|an)\s+', '', answer)

    # Normalize number formatting
    # "1,234" -> "1234"
    answer = re.sub(r'(\d),(\d)', r'\1\2', answer)

    # Normalize units (kilometers, km, etc.)
    unit_map = {
        r'\bkilometers?\b': 'km',
        r'\bmeters?\b': 'm',
        r'\bmillions?\b': 'million',
        r'\bbillions?\b': 'billion',
        r'\bpercent\b': '%',
    }
    for pattern, replacement in unit_map.items():
        answer = re.sub(pattern, replacement, answer)

    # Strip whitespace
    answer = ' '.join(answer.split())

    return answer


def answers_match(predicted: str, expected: str) -> bool:
    """Check if two answers match after normalization."""
    return normalize_answer(predicted) == normalize_answer(expected)


# ── GAIA task representation ───────────────────────────────────────────────────

@dataclass
class GAIATask:
    task_id: str
    question: str
    level: int
    expected_answer: str
    file_name: Optional[str] = None
    metadata: dict = None


@dataclass
class GAIAResult:
    task_id: str
    level: int
    question: str
    expected_answer: str
    predicted_answer: Optional[str]
    correct: bool
    steps_taken: int
    total_tokens: int
    duration_seconds: float
    trajectory_summary: list[str]


# ── GAIA-capable agent ─────────────────────────────────────────────────────────

class GAIAAgent:
    """
    Agent capable of handling GAIA tasks.
    In production, replace mock tools with real implementations.
    """

    def __init__(self, max_steps: int = 30):
        self.max_steps = max_steps
        self.tools = self._build_tools()

    def _build_tools(self) -> list[dict]:
        return [
            {
                "name": "web_search",
                "description": "Search the web for factual information. Use specific queries.",
                "input_schema": {
                    "type": "object",
                    "properties": {
                        "query": {"type": "string", "description": "Specific search query"},
                        "num_results": {"type": "integer", "default": 5},
                    },
                    "required": ["query"],
                },
            },
            {
                "name": "read_webpage",
                "description": "Read the full content of a webpage given its URL.",
                "input_schema": {
                    "type": "object",
                    "properties": {
                        "url": {"type": "string", "description": "Full URL to read"},
                    },
                    "required": ["url"],
                },
            },
            {
                "name": "read_file",
                "description": "Read a file (PDF, CSV, TXT, XLSX) and extract its text content.",
                "input_schema": {
                    "type": "object",
                    "properties": {
                        "file_path": {"type": "string"},
                        "sheet_name": {"type": "string", "description": "For Excel files"},
                    },
                    "required": ["file_path"],
                },
            },
            {
                "name": "execute_python",
                "description": "Execute Python code and return the output. Use for calculations.",
                "input_schema": {
                    "type": "object",
                    "properties": {
                        "code": {"type": "string", "description": "Python code to execute"},
                    },
                    "required": ["code"],
                },
            },
        ]

    def run(self, task: GAIATask) -> tuple[Optional[str], list[str], int, int]:
        """
        Run the agent on a GAIA task.
        Returns (predicted_answer, trajectory_summary, steps_taken, total_tokens).
        """
        system = """You are a precise research agent solving GAIA benchmark tasks.

Rules:
1. Use tools to find exact, verifiable information. Never guess.
2. For numerical answers, compute exactly - do not round unless the question says to.
3. When counting words, paragraphs, or items, count carefully.
4. Provide your final answer as: FINAL ANSWER: [exact answer]
5. The answer should be concise - a number, name, date, or short phrase.
6. If you need a file, use read_file. If you need web information, use web_search first.
"""

        messages = [{"role": "user", "content": task.question}]
        trajectory = []
        total_tokens = 0
        steps = 0

        for _ in range(self.max_steps):
            steps += 1
            response = client.messages.create(
                model="claude-opus-4-6",
                max_tokens=4096,
                system=system,
                tools=self.tools,
                messages=messages,
            )
            total_tokens += response.usage.input_tokens + response.usage.output_tokens

            text = next((b.text for b in response.content if hasattr(b, "text")), "")

            if text:
                trajectory.append(f"Step {steps} [LLM]: {text[:100]}...")

            # Check for final answer
            if "FINAL ANSWER:" in text:
                answer = text.split("FINAL ANSWER:")[-1].strip()
                # Clean up the answer
                answer = answer.split('\n')[0].strip()
                return answer, trajectory, steps, total_tokens

            if response.stop_reason == "end_turn":
                # Extract from text if no explicit marker
                return text.strip(), trajectory, steps, total_tokens

            if response.stop_reason == "tool_use":
                messages.append({"role": "assistant", "content": response.content})
                tool_results = []

                for block in response.content:
                    if block.type == "tool_use":
                        result = self._execute_tool(block.name, block.input, task)
                        trajectory.append(
                            f"Step {steps} [Tool:{block.name}]: {str(result)[:80]}..."
                        )
                        tool_results.append({
                            "type": "tool_result",
                            "tool_use_id": block.id,
                            "content": str(result),
                        })

                messages.append({"role": "user", "content": tool_results})

        return None, trajectory, steps, total_tokens

    def _execute_tool(self, name: str, tool_input: dict, task: GAIATask) -> str:
        """Execute a tool. Replace with real implementations."""
        if name == "web_search":
            query = tool_input.get("query", "")
            # In production: call real search API (Serper, Tavily, etc.)
            return f"[MOCK] Search results for '{query}': ..."

        elif name == "read_webpage":
            url = tool_input.get("url", "")
            # In production: use requests + BeautifulSoup or playwright
            return f"[MOCK] Content of {url}: ..."

        elif name == "read_file":
            file_path = tool_input.get("file_path", task.file_name or "")
            # In production: use PyMuPDF, openpyxl, pandas
            return f"[MOCK] Content of {file_path}: ..."

        elif name == "execute_python":
            code = tool_input.get("code", "")
            try:
                # WARNING: In production, use a sandboxed execution environment
                import io, contextlib
                stdout = io.StringIO()
                with contextlib.redirect_stdout(stdout):
                    exec(code, {"__builtins__": __builtins__})  # noqa: S102
                return stdout.getvalue() or "Code executed successfully (no output)"
            except Exception as e:
                return f"Error: {e}"

        return f"Unknown tool: {name}"


# ── GAIA evaluation runner ─────────────────────────────────────────────────────

class GAIAEvaluator:
    """Runs an agent on GAIA tasks and computes scores."""

    def __init__(self, agent: GAIAAgent):
        self.agent = agent

    def evaluate_tasks(self, tasks: list[GAIATask]) -> list[GAIAResult]:
        results = []
        for i, task in enumerate(tasks):
            print(f"Task {i+1}/{len(tasks)} (Level {task.level}): {task.question[:60]}...")
            t0 = time.time()

            predicted, trajectory, steps, tokens = self.agent.run(task)
            duration = time.time() - t0
            correct = answers_match(predicted or "", task.expected_answer)

            result = GAIAResult(
                task_id=task.task_id,
                level=task.level,
                question=task.question,
                expected_answer=task.expected_answer,
                predicted_answer=predicted,
                correct=correct,
                steps_taken=steps,
                total_tokens=tokens,
                duration_seconds=duration,
                trajectory_summary=trajectory,
            )
            results.append(result)

            status = "CORRECT" if correct else "WRONG"
            print(f"  [{status}] Expected: {task.expected_answer!r} | Got: {predicted!r}")

        return results

    def score_report(self, results: list[GAIAResult]) -> dict:
        if not results:
            return {}

        report = {"overall": {}, "by_level": {}}

        # Overall
        correct = sum(1 for r in results if r.correct)
        report["overall"] = {
            "score": correct / len(results),
            "correct": correct,
            "total": len(results),
            "avg_steps": sum(r.steps_taken for r in results) / len(results),
            "avg_tokens": sum(r.total_tokens for r in results) / len(results),
            "avg_duration_s": sum(r.duration_seconds for r in results) / len(results),
        }

        # By level
        for level in [1, 2, 3]:
            level_results = [r for r in results if r.level == level]
            if level_results:
                level_correct = sum(1 for r in level_results if r.correct)
                report["by_level"][f"level_{level}"] = {
                    "score": level_correct / len(level_results),
                    "correct": level_correct,
                    "total": len(level_results),
                }

        return report

    def print_report(self, report: dict):
        print("\n── GAIA Evaluation Report ──────────────────────")
        overall = report["overall"]
        print(f"Overall: {overall['score']:.1%} ({overall['correct']}/{overall['total']})")
        print(f"Avg steps: {overall['avg_steps']:.1f} | "
              f"Avg tokens: {overall['avg_tokens']:,.0f} | "
              f"Avg time: {overall['avg_duration_s']:.1f}s")

        print("\nBy Level:")
        for level_key, level_data in report.get("by_level", {}).items():
            print(f"  {level_key}: {level_data['score']:.1%} "
                  f"({level_data['correct']}/{level_data['total']})")


# ── Building GAIA-style tasks for your domain ──────────────────────────────────

def create_domain_specific_gaia_tasks(domain: str) -> list[GAIATask]:
    """
    Template for creating GAIA-style tasks for a specific domain.
    Designed to test the same capabilities GAIA tests: web search,
    multi-hop reasoning, exact answer extraction.
    """
    templates = {
        "finance": [
            GAIATask(
                task_id="fin_001",
                question="What was Apple's total revenue in fiscal year 2023 according to their 10-K filing?",
                level=1,
                expected_answer="383.285 billion",
            ),
            GAIATask(
                task_id="fin_002",
                question="What is the sum of the market caps of the FAANG companies as of the most recent quarter?",
                level=2,
                expected_answer="...",  # to be computed
            ),
        ],
        "engineering": [
            GAIATask(
                task_id="eng_001",
                question="According to the Python 3.12 documentation, how many new type parameter syntaxes were introduced?",
                level=1,
                expected_answer="3",
            ),
        ],
    }
    return templates.get(domain, [])


# ── Demo ───────────────────────────────────────────────────────────────────────

def demo():
    """Demo with a small set of synthetic GAIA-style tasks."""
    tasks = [
        GAIATask(
            task_id="demo_001",
            question="What is 17 multiplied by 34, plus 128?",
            level=1,
            expected_answer="706",
        ),
        GAIATask(
            task_id="demo_002",
            question="If a country has a GDP of $2.5 trillion and spends 4.2% on education, "
                     "how much is the education budget in billions?",
            level=1,
            expected_answer="105",
        ),
    ]

    agent = GAIAAgent(max_steps=10)
    evaluator = GAIAEvaluator(agent)

    results = evaluator.evaluate_tasks(tasks)
    report = evaluator.score_report(results)
    evaluator.print_report(report)


if __name__ == "__main__":
    demo()

GAIA-Style Task Design Principles

When building domain-specific GAIA-style benchmarks, follow these principles:

Principle 1: Require actual retrieval. The answer must not appear in the model's training data. Link it to a specific source that must be accessed.

Principle 2: Make the answer exact and verifiable. "A number" or "a name" - something that can be objectively checked. Avoid subjective questions.

Principle 3: Require multi-hop. One search result should not contain the answer. The agent must chain across at least 2 sources.

Principle 4: Control for shortcut avoidance. Test that the question cannot be answered by guessing the most common or plausible answer. Use specific numbers, dates, and unusual facts.

Principle 5: Test diverse tools. Ensure the benchmark covers file reading, computation, web search, and multi-modal tasks - not just web search.

Production Engineering Notes

Use GAIA Validation for Development

The GAIA validation set (with public answers) is appropriate for development iteration. Never tune specifically to the validation set - treat it as a proxy for the test set. Report test set scores for public comparisons.

Track Per-Capability Failure Modes

Break your GAIA failures down by required capability: web search, file reading, computation, multi-hop. If you fail 80% of computation tasks but 40% of web search tasks, your bottleneck is clear. Fix the bottleneck before optimizing elsewhere.

Cost Budgeting

A GAIA Level 3 task can easily require 30 steps and 50,000 tokens. At scale:

115 Level 3 tasks × 50K tokens × $0.015/1K =$ 86 per evaluation run

Plan accordingly. Use Level 1 tasks for rapid iteration, Level 2-3 for pre-release evaluation.

:::danger Do Not Overfit to the Validation Set GAIA's validation set is public. It is tempting to tune prompts specifically for validation tasks. This produces inflated scores that do not generalize. Treat the validation set as a general capability probe, not a target to optimize. Any prompt change that improves validation performance should be explainable by a general capability improvement - not by memorizing specific question patterns. :::

:::warning Contamination Detection As of 2025, there is evidence that some GAIA validation tasks appear in training data of certain models. When interpreting GAIA scores, always check whether the model has known contamination with the benchmark. HuggingFace provides guidance on checking for overlap with model training sets. :::

Interview Q&A

Q: What is GAIA and what makes it a good agent benchmark?

A: GAIA (General AI Assistants) is a benchmark developed by HuggingFace, Meta, and HEC Paris that tests agents on real-world tasks requiring web search, file reading, code execution, and multi-step reasoning. It is a good benchmark for several reasons: answers are exact and verifiable (no subjective scoring), tasks genuinely require multi-step reasoning (cannot be solved by pattern matching), they require diverse tools (web, file, code), and the human baseline is very high (92%) which means there is meaningful room between current agent performance and human-level. The three difficulty levels (L1 ~10 steps, L2 ~20 steps, L3 30+ steps) allow granular assessment of capability.

Q: An agent scores 70% on GAIA Level 1 but only 25% on Level 3. What does this tell you about the agent?

A: This gap reveals a specific capability limitation: the agent can handle simple single-hop retrieval and computation (Level 1) but breaks down on complex multi-hop reasoning chains (Level 3). The most likely causes are: context window management failure (too many accumulated steps exceed what the model can reason over effectively), compounding error rates (each hop at 85% success = 35% success over 10 hops), poor planning (the agent does not decompose Level 3 problems into manageable sub-questions), or precision errors (small inaccuracies at intermediate steps invalidate the final answer). I would investigate by analyzing which step in failed Level 3 trajectories first deviates from the correct path.

Q: How does GAIA differ from MMLU or similar knowledge benchmarks?

A: MMLU tests knowledge recall - it is essentially a multiple-choice test of what information is in a model's weights. GAIA tests agentic capability - the ability to retrieve, process, and reason over information using tools. An LLM with no tool access scores ~10% on GAIA and ~75% on MMLU. This distinction matters for production: if your agent relies on tools to serve users, MMLU scores are almost irrelevant to predicting production quality. GAIA is a much better proxy for general-purpose agent capability.

Q: What is the "multi-hop compound error" problem in GAIA and how do you address it?

A: In GAIA tasks requiring N hops, if each hop has probability p of being correct, the probability of all N hops being correct is $p^N$ . At p=0.85 and N=10, success probability is only 20%. This is the compound error problem. Mitigation strategies: intermediate verification (after each hop, verify the extracted fact against the source), explicit uncertainty tracking (the agent tracks confidence per step and re-searches when uncertain), decomposition prompting (breaking the task into clearly named sub-questions and solving each independently), and error recovery (when a late-stage fact contradicts earlier findings, backtrack and re-solve the conflicting hop).

Q: You want to build a GAIA-style benchmark for a specific domain. What are the key design principles?

A: The five key principles are: require actual retrieval (answers must not be in training data, must come from specified sources), use exact verifiable answers (numbers, names, dates - not subjective quality), require multi-hop reasoning (no single tool call should suffice), prevent shortcut answers (questions should not be answerable by the most plausible guess), and cover diverse tools (mix web search, file reading, code execution, and multi-modal tasks). The hardest part is "shortcut avoidance" - pilot the tasks with a strong model and see if it answers correctly without using tools. If it does, the task is too easy for a benchmark.

Advanced: GAIA Failure Analysis Pipeline

Once you have run your agent against GAIA, systematic failure analysis reveals which capabilities to improve next. Here is a complete failure analysis framework:

from dataclasses import dataclass
from typing import Optional
import json
import re
from collections import defaultdict


@dataclass
class GAIAFailureAnalysis:
    instance_id: str
    level: int
    question: str
    expected: str
    predicted: Optional[str]
    status: str  # "wrong_answer", "timeout", "stuck", "format_error"
    steps_taken: int
    failure_step: Optional[int]  # Which step in the trajectory failed
    failure_category: Optional[str]  # "goal_drift", "hallucination", "unit_error", etc.
    trajectory_summary: list[str]


def categorize_failure(
    question: str,
    expected: str,
    predicted: Optional[str],
    trajectory: list[str],
) -> str:
    """
    Heuristically categorize a GAIA failure by type.
    Categories:
    - "no_answer": agent did not produce a final answer
    - "unit_error": correct number, wrong units
    - "precision_error": approximately correct, not exact
    - "wrong_source": retrieved from wrong source
    - "goal_drift": answered a different question
    - "table_misread": likely read a table incorrectly
    - "counting_error": off-by-one or counting mistake
    - "other": unclassified
    """
    if predicted is None:
        return "no_answer"

    pred_clean = predicted.lower().strip()
    exp_clean = expected.lower().strip()

    # Check for unit errors: numbers match but text differs
    pred_nums = re.findall(r'\d+(?:\.\d+)?', pred_clean)
    exp_nums = re.findall(r'\d+(?:\.\d+)?', exp_clean)

    if pred_nums and exp_nums and pred_nums[0] == exp_nums[0]:
        if pred_clean != exp_clean:
            return "unit_error"

    # Check for precision errors: predicted is close but not exact
    if pred_nums and exp_nums:
        try:
            pred_val = float(pred_nums[0])
            exp_val = float(exp_nums[0])
            if abs(pred_val - exp_val) / max(abs(exp_val), 1e-8) < 0.1:
                return "precision_error"
        except ValueError:
            pass

    # Check for counting errors (off by one)
    if pred_nums and exp_nums:
        try:
            pred_val = int(float(pred_nums[0]))
            exp_val = int(float(exp_nums[0]))
            if abs(pred_val - exp_val) == 1:
                return "counting_error"
        except ValueError:
            pass

    # Heuristic: table questions that fail often involve table misreads
    table_keywords = ["table", "column", "row", "spreadsheet", "excel", "csv"]
    if any(kw in question.lower() for kw in table_keywords):
        return "table_misread"

    # Heuristic: if trajectory is long but answer is wrong, likely goal drift
    if len(trajectory) > 15:
        return "goal_drift"

    return "other"


def analyze_failures(
    results: list[dict],
    trajectory_data: dict,  # {instance_id: list[str]}
) -> dict:
    """
    Perform systematic failure analysis on GAIA results.
    Returns actionable breakdown by failure category.
    """
    failures = [r for r in results if not r["correct"]]

    category_counts = defaultdict(int)
    level_category_counts = defaultdict(lambda: defaultdict(int))

    analyses = []
    for r in failures:
        trajectory = trajectory_data.get(r.get("task_id", ""), [])
        category = categorize_failure(
            r["question"],
            r["expected"],
            r.get("predicted"),
            trajectory,
        )
        category_counts[category] += 1
        level_category_counts[r["level"]][category] += 1

        analyses.append(GAIAFailureAnalysis(
            instance_id=r.get("task_id", ""),
            level=r["level"],
            question=r["question"][:100],
            expected=r["expected"],
            predicted=r.get("predicted"),
            status="wrong_answer" if r.get("predicted") else "no_answer",
            steps_taken=r.get("steps", 0),
            failure_step=None,  # Would require more detailed trajectory analysis
            failure_category=category,
            trajectory_summary=trajectory[:5],
        ))

    total_failures = len(failures)
    total_tasks = len(results)

    return {
        "summary": {
            "total_tasks": total_tasks,
            "total_failures": total_failures,
            "failure_rate": total_failures / total_tasks if total_tasks > 0 else 0,
        },
        "failure_categories": {
            cat: {
                "count": count,
                "rate": count / total_failures if total_failures > 0 else 0,
                "recommended_fix": {
                    "no_answer": "Add explicit FINAL ANSWER: format requirement; increase max iterations",
                    "unit_error": "Add unit normalization step to agent reasoning chain",
                    "precision_error": "Require agent to compute exact value, not approximate",
                    "counting_error": "Add explicit verification step for counting tasks",
                    "table_misread": "Use structured table parser instead of raw text extraction",
                    "goal_drift": "Add goal re-statement at each reasoning step",
                    "other": "Manual review required - examine specific failure cases",
                }.get(cat, "Review specific cases"),
            }
            for cat, count in sorted(category_counts.items(), key=lambda x: x[1], reverse=True)
        },
        "by_level": {
            f"level_{level}": dict(cats)
            for level, cats in level_category_counts.items()
        },
    }


def print_failure_report(report: dict):
    """Print a human-readable failure analysis report."""
    summary = report["summary"]
    print(f"\n── GAIA Failure Analysis ────────────────────────")
    print(f"Total tasks:    {summary['total_tasks']}")
    print(f"Total failures: {summary['total_failures']} ({summary['failure_rate']:.1%})")
    print(f"\nFailure categories (most common first):")

    for cat, data in report["failure_categories"].items():
        print(f"\n  {cat}: {data['count']} ({data['rate']:.1%} of failures)")
        print(f"    Recommended fix: {data['recommended_fix']}")

    print(f"\nBy level:")
    for level_key, cats in report["by_level"].items():
        print(f"\n  {level_key}:")
        for cat, count in sorted(cats.items(), key=lambda x: x[1], reverse=True):
            print(f"    {cat}: {count}")

Using Failure Analysis to Guide Improvement

The failure analysis output tells you where to invest improvement effort:

Dominant Failure Category	Root Cause	Fix
`no_answer`	Agent times out or gets stuck	Increase max_iterations; add explicit stopping criteria
`unit_error`	Agent retrieves right number, wrong unit	Add unit normalization to reasoning chain
`precision_error`	Agent rounds or estimates	Require exact computation with Python interpreter
`table_misread`	Agent reads wrong row/column	Add structured table parsing tool
`goal_drift`	Agent forgets original question	Add goal re-statement to system prompt
`counting_error`	Off-by-one in counting tasks	Add explicit verify-by-counting step

The power of systematic failure analysis: instead of vaguely knowing "the agent is not good at Level 3," you know specifically that 38% of Level 3 failures are goal_drift and 24% are table_misread. These two categories have known fixes that can be implemented and verified in a week.

GAIA as a Capability Radar

Thinking of GAIA as a radar chart of agent capabilities - not a single score - gives a more complete picture:

The radar reveals a consistent pattern across all current systems: single-step retrieval is reliable, multi-step reasoning is inconsistent, and complex planning with backtracking is the frontier. Agent development efforts that target the "weak" category produce the most GAIA score improvement per engineering-hour invested.

GAIA is ultimately not a score to maximize - it is a diagnostic tool. Its value is in telling you exactly where your agent's capabilities break down, with enough granularity to guide targeted improvements. Use it that way, and it will pay dividends across your entire evaluation and improvement workflow.

:::tip Start With Level 2 When first running your agent against GAIA, focus your analysis on Level 2 results. Level 1 is too easy to be diagnostic for capable agents, and Level 3 requires capabilities most agents do not yet have. Level 2 questions - requiring 5-15 steps across multiple tools - are exactly in the range where agent architectural choices (planning strategy, context management, tool design) make the most difference. Improving Level 2 performance from 30% to 50% is achievable in weeks; it reveals your most impactful capability gaps; and it correlates better with real-world assistant usefulness than either Level 1 or Level 3 scores. :::

Relationship to Other Evaluation Methods

GAIA is one part of a complete agent evaluation strategy. Here is how it fits with the other techniques in this module:

Method	What It Measures	When to Use
GAIA	General multi-step reasoning with diverse tools	Overall capability assessment, SOTA comparison
SWE-bench Verified	Coding-specific bug fixing	Coding agent evaluation
Trajectory Evaluation	Step efficiency, backtracking, cost	Any agent, continuous CI/CD monitoring
LLM Judge	Holistic quality on open-ended tasks	Customer support, research, writing agents
Human Evaluation	Ground truth quality signal	Calibration, safety review, novel task types
Production Monitoring	Real-world performance on actual users	After deployment, continuous quality tracking

No single method is sufficient. GAIA tells you whether your agent can handle complex real-world tasks - but it does not tell you whether it is efficient (trajectory evaluation), safe (human evaluation), or performing well on your specific user base (production monitoring). Build the full stack.

The investment in GAIA evaluation is justified by what it reveals: systematic capability gaps that internal testing consistently misses, because internal tests are written by the same engineers who built the agent and therefore target the capabilities the agent already has. GAIA tests capabilities the agent might not have, on task types nobody on your team thought to include in the test suite. That is its irreplaceable value.

:::note Benchmark Literacy When reading AI research papers or vendor claims, always ask: which benchmark? at which difficulty level? with which tools? by which scoring method? A system that reports "60% on GAIA" without specifying whether that is the full benchmark, the validation set, Level 1 only, or a cherry-picked subset is reporting a number that cannot be interpreted. The benchmark details matter as much as the score. This applies to GAIA, SWE-bench, and every other evaluation framework. :::

The ability to critically interpret benchmark claims is as valuable as the ability to run evaluations yourself. When GAIA scores appear in papers, press releases, or vendor pitches, you now have the framework to evaluate what those scores actually mean - and what questions to ask when they do not tell the complete story.

Continue to the next lesson - SWE-bench Verified - for the domain-specific complement to GAIA: rigorous evaluation of coding agents on real production issues.

Every point gained on GAIA Level 3 represents a genuine capability improvement that will transfer to real production tasks. Measure it carefully, understand what drives it, and let the data guide your next iteration.

The Test That Humbles Agents​

Why GAIA Was Created​

The Problem With Simple Benchmarks​

GAIA Task Structure​

Level 1 Examples​

Level 2 Examples​

Level 3 Examples​

What GAIA Tests​

GAIA Scoring​

Exact Match with Normalization​

Score Computation​

Current SOTA: 2024-2025​

What Makes GAIA Hard​

Answers Don't Appear in Top Search Results​

Multi-Hop Chains Break Under Pressure​

Precision Matters​

Adversarial Phrasing​

GAIA vs Other Benchmarks​

Dataset Access​

Running GAIA Locally​

Setup​

GAIA-Style Task Design Principles​

Production Engineering Notes​

Use GAIA Validation for Development​

Track Per-Capability Failure Modes​

Cost Budgeting​

Interview Q&A​

Advanced: GAIA Failure Analysis Pipeline​

Using Failure Analysis to Guide Improvement​

GAIA as a Capability Radar​

Further Reading​

Relationship to Other Evaluation Methods​