GAIA Benchmark
The Test That Humbles Agentsโ
You have spent weeks tuning your agent. It handles your internal test cases with 90% accuracy. You feel good about it.
Then you try GAIA.
GAIA - General AI Assistants - tests agents on real-world tasks that seem simple but require precise multi-step reasoning, actual web browsing, real file reading, and cross-source fact verification. Tasks like: "What is the total number of points scored in the Super Bowl games played in New Orleans?" or "In the Wikipedia article about the Eiffel Tower, what is the 7th word of the third paragraph?" These questions have exact, verifiable answers. They require real tool use, real information retrieval, and careful reasoning. They are hard.
In 2024, the best models scored around 30% on Level 3 tasks. GPT-4 with browsing: 15%. Humans: 92%.
GAIA exists precisely to measure this gap - and to provide a rigorous, meaningful target for agent improvement.
:::tip ๐ฎ Interactive Playground Visualize this concept: Try the Agent Evaluation demo on the EngineersOfAI Playground - no code required. :::
Why GAIA Was Createdโ
The Problem With Simple Benchmarksโ
Before GAIA, common agent benchmarks had a serious problem: the best approaches involved pattern matching, prompt tricks, or memorized answers, rather than genuine multi-step reasoning. Models would score highly not by being good agents but by being good at the specific format of the benchmark.
GAIA was designed by researchers at HuggingFace, Meta, and HEC Paris (Mialon et al., 2023) to resist this. The design principles:
- Real answers from real sources. Not questions whose answers appear prominently in training data - questions requiring actual retrieval and computation.
- Multi-step necessity. No single search or tool call is sufficient. Tasks are structured to require 3โ30+ steps.
- Exact answers. Not subjective quality - exact match with normalization. The answer is either right or wrong.
- Diverse tools. Different tasks require different combinations of web search, file reading, code execution, image analysis, and multi-hop reasoning.
The result is a benchmark where strong performance genuinely indicates a capable general-purpose agent.
GAIA Task Structureโ
GAIA has 450 tasks in the public validation set, organized into three levels of difficulty:
Level 1 Examplesโ
Simple retrieval with one or two steps:
- "What is the population of Iceland according to its Wikipedia article?"
- "In the paper 'Attention Is All You Need', what year was it published?"
- "Calculate 15% of 847."
These require a single tool call (web search or file read) plus extraction of the answer. A capable agent should score 70-80% on Level 1.
Level 2 Examplesโ
Multi-source, multi-hop, cross-referencing:
- "What is the sum of the populations of the three countries that border France that are not in the G7?"
- "Looking at the publicly available budget spreadsheet for the city of Austin, Texas, what was the total capital expenditure in 2022 in millions?"
- "According to the IMDB page for the movie released in 1994 that shares its name with an Eminem album, who directed it?"
These require 5โ15 steps: finding sources, reading them, computing derived quantities, cross-referencing facts.
Level 3 Examplesโ
Complex, adversarial, requiring precise reasoning and many tools:
- "Find the Wikipedia article that describes the first joint space mission between the US and the Soviet Union. In that article, what is the third sentence of the 'Mission profile' section, and how many words does that sentence contain?"
- Multi-file analysis tasks where an agent must read, parse, and synthesize information from attached files (PDFs, spreadsheets, images)
Level 3 tests the full depth of agent capability. As of 2025, state-of-the-art agents score 25โ35%.
What GAIA Testsโ
| Capability | Description | Required for Level |
|---|---|---|
| Web search | Finding information via search engine | 1, 2, 3 |
| Web navigation | Following links, reading pages | 2, 3 |
| File reading | PDF, spreadsheet, image parsing | 2, 3 |
| Code execution | Running Python for computation | 2, 3 |
| Multi-hop reasoning | Chaining facts across sources | 2, 3 |
| Arithmetic | Exact numerical computation | 1, 2, 3 |
| Fact verification | Checking claims against sources | 2, 3 |
| Multi-modal | Image and document understanding | 3 |
GAIA's diversity is one of its strengths. An agent that is good at web search but poor at file reading will score poorly. An agent that is good at retrieval but poor at arithmetic will miss computational questions. High scores require genuine breadth.
GAIA Scoringโ
Exact Match with Normalizationโ
GAIA uses exact match as the primary scoring criterion, with normalization to handle surface-form variation:
- Strip punctuation: Remove trailing periods, commas, parentheses
- Normalize numbers: "1,234" == "1234" == "1.234 thousand"
- Normalize units: "15 km" == "15 kilometers"
- Case-insensitive: "Paris" == "paris"
- Article normalization: "The United States" == "United States"
After normalization, the answer either matches or it does not. No partial credit in the standard GAIA evaluation.
Score Computationโ
Separate scores are reported for each level, plus an overall score. A competitive result requires:
- Level 1: 70%+
- Level 2: 45%+
- Level 3: 25%+
- Overall: 50%+
Current SOTA: 2024-2025โ
| Model/System | Level 1 | Level 2 | Level 3 | Overall |
|---|---|---|---|---|
| Human baseline | 94.4% | 91.7% | 90.5% | 92.0% |
| Best open-source agents (2025) | 77% | 55% | 32% | 55% |
| GPT-4o + advanced tools (2024) | 71% | 48% | 25% | 50% |
| Claude agents (2024) | 68% | 45% | 22% | 46% |
| GPT-4 + browsing (early 2024) | 45% | 28% | 12% | 30% |
| GPT-4 alone (no tools) | 15% | 8% | 4% | 10% |
Key observations:
- Tools are essential - GPT-4 alone scores 10%, with tools 30%+.
- The human-AI gap is still enormous at Level 3 (92% vs 32%).
- Progress is real - scores improved from 15% to 50%+ overall in 18 months.
- Level 3 is the active frontier - most improvement opportunity remains there.
What Makes GAIA Hardโ
Answers Don't Appear in Top Search Resultsโ
For many GAIA tasks, the answer is not in the first search result. The agent must follow links, read secondary sources, and synthesize. An agent that stops at the first plausible-sounding answer fails.
Multi-Hop Chains Break Under Pressureโ
A 5-hop reasoning chain has a 35% success rate if each hop is 85% reliable. Even small errors at each step compound. Level 3 tasks require 10+ hops.
Precision Mattersโ
"The population is about 350,000" fails if the answer is 348,271. GAIA rewards precision, not approximate reasoning. Agents that round, estimate, or paraphrase instead of looking up exact values score poorly.
Adversarial Phrasingโ
GAIA questions are designed to test exact comprehension. "What is the 7th word of the third paragraph?" requires counting. "Which countries border France that are NOT in the G7?" requires set logic. These phrasings trip up agents that pattern-match rather than reason carefully.
GAIA vs Other Benchmarksโ
| Benchmark | Focus | Task Type | Answer Format | Tool Diversity |
|---|---|---|---|---|
| GAIA | General-purpose agents | Real-world research | Exact match | High (web + file + code) |
| WebArena | Web navigation | GUI interaction | Task success | Low (web only) |
| SWE-bench | Coding agents | GitHub issues | Test pass rate | Low (code only) |
| ฯ-bench | Tool use | API calls | Output match | Medium |
| AgentBench | General agents | 8 environments | Environment-specific | High |
| MMLU | Knowledge | Multiple choice | Label match | None |
When to use GAIA: when evaluating a general-purpose assistant that must search, compute, and reason. Not appropriate for specialized coding, web navigation, or pure knowledge retrieval agents.
When to use SWE-bench: when evaluating coding agents specifically. See the next lesson.
When to use WebArena: when evaluating agents that navigate real websites.
Dataset Accessโ
GAIA is available on the HuggingFace Hub:
from datasets import load_dataset
# Public validation set (answers provided)
dataset = load_dataset("gaia-benchmark/GAIA", "2023_all")
# Splits
validation = dataset["validation"] # 165+170+115 tasks, answers public
test = dataset["test"] # answers private, submit to leaderboard
# Task structure
example = validation[0]
print(example.keys())
# dict_keys(['task_id', 'Question', 'Level', 'final_answer',
# 'file_name', 'Annotator Metadata'])
print(f"Level: {example['Level']}")
print(f"Question: {example['Question'][:200]}")
print(f"Expected answer: {example['final_answer']}")
The validation set includes answers, making it suitable for offline development and iteration. The test set requires submission to the leaderboard for scoring.
Running GAIA Locallyโ
Setupโ
"""
GAIA evaluation harness - run your agent on GAIA tasks and compute scores.
"""
import json
import re
import time
from dataclasses import dataclass
from typing import Optional
import anthropic
try:
from datasets import load_dataset
HAS_DATASETS = True
except ImportError:
HAS_DATASETS = False
print("Install datasets: pip install datasets")
client = anthropic.Anthropic()
# โโ Answer normalization โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
def normalize_answer(answer: str) -> str:
"""
Normalize an answer for GAIA exact-match scoring.
Handles numbers, units, punctuation, and case.
"""
if not answer:
return ""
answer = answer.strip()
# Remove trailing punctuation
answer = re.sub(r'[.,;:!?]+$', '', answer)
# Lowercase
answer = answer.lower()
# Remove articles at start
answer = re.sub(r'^(the|a|an)\s+', '', answer)
# Normalize number formatting
# "1,234" -> "1234"
answer = re.sub(r'(\d),(\d)', r'\1\2', answer)
# Normalize units (kilometers, km, etc.)
unit_map = {
r'\bkilometers?\b': 'km',
r'\bmeters?\b': 'm',
r'\bmillions?\b': 'million',
r'\bbillions?\b': 'billion',
r'\bpercent\b': '%',
}
for pattern, replacement in unit_map.items():
answer = re.sub(pattern, replacement, answer)
# Strip whitespace
answer = ' '.join(answer.split())
return answer
def answers_match(predicted: str, expected: str) -> bool:
"""Check if two answers match after normalization."""
return normalize_answer(predicted) == normalize_answer(expected)
# โโ GAIA task representation โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
@dataclass
class GAIATask:
task_id: str
question: str
level: int
expected_answer: str
file_name: Optional[str] = None
metadata: dict = None
@dataclass
class GAIAResult:
task_id: str
level: int
question: str
expected_answer: str
predicted_answer: Optional[str]
correct: bool
steps_taken: int
total_tokens: int
duration_seconds: float
trajectory_summary: list[str]
# โโ GAIA-capable agent โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
class GAIAAgent:
"""
Agent capable of handling GAIA tasks.
In production, replace mock tools with real implementations.
"""
def __init__(self, max_steps: int = 30):
self.max_steps = max_steps
self.tools = self._build_tools()
def _build_tools(self) -> list[dict]:
return [
{
"name": "web_search",
"description": "Search the web for factual information. Use specific queries.",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Specific search query"},
"num_results": {"type": "integer", "default": 5},
},
"required": ["query"],
},
},
{
"name": "read_webpage",
"description": "Read the full content of a webpage given its URL.",
"input_schema": {
"type": "object",
"properties": {
"url": {"type": "string", "description": "Full URL to read"},
},
"required": ["url"],
},
},
{
"name": "read_file",
"description": "Read a file (PDF, CSV, TXT, XLSX) and extract its text content.",
"input_schema": {
"type": "object",
"properties": {
"file_path": {"type": "string"},
"sheet_name": {"type": "string", "description": "For Excel files"},
},
"required": ["file_path"],
},
},
{
"name": "execute_python",
"description": "Execute Python code and return the output. Use for calculations.",
"input_schema": {
"type": "object",
"properties": {
"code": {"type": "string", "description": "Python code to execute"},
},
"required": ["code"],
},
},
]
def run(self, task: GAIATask) -> tuple[Optional[str], list[str], int, int]:
"""
Run the agent on a GAIA task.
Returns (predicted_answer, trajectory_summary, steps_taken, total_tokens).
"""
system = """You are a precise research agent solving GAIA benchmark tasks.
Rules:
1. Use tools to find exact, verifiable information. Never guess.
2. For numerical answers, compute exactly - do not round unless the question says to.
3. When counting words, paragraphs, or items, count carefully.
4. Provide your final answer as: FINAL ANSWER: [exact answer]
5. The answer should be concise - a number, name, date, or short phrase.
6. If you need a file, use read_file. If you need web information, use web_search first.
"""
messages = [{"role": "user", "content": task.question}]
trajectory = []
total_tokens = 0
steps = 0
for _ in range(self.max_steps):
steps += 1
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=4096,
system=system,
tools=self.tools,
messages=messages,
)
total_tokens += response.usage.input_tokens + response.usage.output_tokens
text = next((b.text for b in response.content if hasattr(b, "text")), "")
if text:
trajectory.append(f"Step {steps} [LLM]: {text[:100]}...")
# Check for final answer
if "FINAL ANSWER:" in text:
answer = text.split("FINAL ANSWER:")[-1].strip()
# Clean up the answer
answer = answer.split('\n')[0].strip()
return answer, trajectory, steps, total_tokens
if response.stop_reason == "end_turn":
# Extract from text if no explicit marker
return text.strip(), trajectory, steps, total_tokens
if response.stop_reason == "tool_use":
messages.append({"role": "assistant", "content": response.content})
tool_results = []
for block in response.content:
if block.type == "tool_use":
result = self._execute_tool(block.name, block.input, task)
trajectory.append(
f"Step {steps} [Tool:{block.name}]: {str(result)[:80]}..."
)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": str(result),
})
messages.append({"role": "user", "content": tool_results})
return None, trajectory, steps, total_tokens
def _execute_tool(self, name: str, tool_input: dict, task: GAIATask) -> str:
"""Execute a tool. Replace with real implementations."""
if name == "web_search":
query = tool_input.get("query", "")
# In production: call real search API (Serper, Tavily, etc.)
return f"[MOCK] Search results for '{query}': ..."
elif name == "read_webpage":
url = tool_input.get("url", "")
# In production: use requests + BeautifulSoup or playwright
return f"[MOCK] Content of {url}: ..."
elif name == "read_file":
file_path = tool_input.get("file_path", task.file_name or "")
# In production: use PyMuPDF, openpyxl, pandas
return f"[MOCK] Content of {file_path}: ..."
elif name == "execute_python":
code = tool_input.get("code", "")
try:
# WARNING: In production, use a sandboxed execution environment
import io, contextlib
stdout = io.StringIO()
with contextlib.redirect_stdout(stdout):
exec(code, {"__builtins__": __builtins__}) # noqa: S102
return stdout.getvalue() or "Code executed successfully (no output)"
except Exception as e:
return f"Error: {e}"
return f"Unknown tool: {name}"
# โโ GAIA evaluation runner โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
class GAIAEvaluator:
"""Runs an agent on GAIA tasks and computes scores."""
def __init__(self, agent: GAIAAgent):
self.agent = agent
def evaluate_tasks(self, tasks: list[GAIATask]) -> list[GAIAResult]:
results = []
for i, task in enumerate(tasks):
print(f"Task {i+1}/{len(tasks)} (Level {task.level}): {task.question[:60]}...")
t0 = time.time()
predicted, trajectory, steps, tokens = self.agent.run(task)
duration = time.time() - t0
correct = answers_match(predicted or "", task.expected_answer)
result = GAIAResult(
task_id=task.task_id,
level=task.level,
question=task.question,
expected_answer=task.expected_answer,
predicted_answer=predicted,
correct=correct,
steps_taken=steps,
total_tokens=tokens,
duration_seconds=duration,
trajectory_summary=trajectory,
)
results.append(result)
status = "CORRECT" if correct else "WRONG"
print(f" [{status}] Expected: {task.expected_answer!r} | Got: {predicted!r}")
return results
def score_report(self, results: list[GAIAResult]) -> dict:
if not results:
return {}
report = {"overall": {}, "by_level": {}}
# Overall
correct = sum(1 for r in results if r.correct)
report["overall"] = {
"score": correct / len(results),
"correct": correct,
"total": len(results),
"avg_steps": sum(r.steps_taken for r in results) / len(results),
"avg_tokens": sum(r.total_tokens for r in results) / len(results),
"avg_duration_s": sum(r.duration_seconds for r in results) / len(results),
}
# By level
for level in [1, 2, 3]:
level_results = [r for r in results if r.level == level]
if level_results:
level_correct = sum(1 for r in level_results if r.correct)
report["by_level"][f"level_{level}"] = {
"score": level_correct / len(level_results),
"correct": level_correct,
"total": len(level_results),
}
return report
def print_report(self, report: dict):
print("\nโโ GAIA Evaluation Report โโโโโโโโโโโโโโโโโโโโโโ")
overall = report["overall"]
print(f"Overall: {overall['score']:.1%} ({overall['correct']}/{overall['total']})")
print(f"Avg steps: {overall['avg_steps']:.1f} | "
f"Avg tokens: {overall['avg_tokens']:,.0f} | "
f"Avg time: {overall['avg_duration_s']:.1f}s")
print("\nBy Level:")
for level_key, level_data in report.get("by_level", {}).items():
print(f" {level_key}: {level_data['score']:.1%} "
f"({level_data['correct']}/{level_data['total']})")
# โโ Building GAIA-style tasks for your domain โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
def create_domain_specific_gaia_tasks(domain: str) -> list[GAIATask]:
"""
Template for creating GAIA-style tasks for a specific domain.
Designed to test the same capabilities GAIA tests: web search,
multi-hop reasoning, exact answer extraction.
"""
templates = {
"finance": [
GAIATask(
task_id="fin_001",
question="What was Apple's total revenue in fiscal year 2023 according to their 10-K filing?",
level=1,
expected_answer="383.285 billion",
),
GAIATask(
task_id="fin_002",
question="What is the sum of the market caps of the FAANG companies as of the most recent quarter?",
level=2,
expected_answer="...", # to be computed
),
],
"engineering": [
GAIATask(
task_id="eng_001",
question="According to the Python 3.12 documentation, how many new type parameter syntaxes were introduced?",
level=1,
expected_answer="3",
),
],
}
return templates.get(domain, [])
# โโ Demo โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
def demo():
"""Demo with a small set of synthetic GAIA-style tasks."""
tasks = [
GAIATask(
task_id="demo_001",
question="What is 17 multiplied by 34, plus 128?",
level=1,
expected_answer="706",
),
GAIATask(
task_id="demo_002",
question="If a country has a GDP of $2.5 trillion and spends 4.2% on education, "
"how much is the education budget in billions?",
level=1,
expected_answer="105",
),
]
agent = GAIAAgent(max_steps=10)
evaluator = GAIAEvaluator(agent)
results = evaluator.evaluate_tasks(tasks)
report = evaluator.score_report(results)
evaluator.print_report(report)
if __name__ == "__main__":
demo()
GAIA-Style Task Design Principlesโ
When building domain-specific GAIA-style benchmarks, follow these principles:
Principle 1: Require actual retrieval. The answer must not appear in the model's training data. Link it to a specific source that must be accessed.
Principle 2: Make the answer exact and verifiable. "A number" or "a name" - something that can be objectively checked. Avoid subjective questions.
Principle 3: Require multi-hop. One search result should not contain the answer. The agent must chain across at least 2 sources.
Principle 4: Control for shortcut avoidance. Test that the question cannot be answered by guessing the most common or plausible answer. Use specific numbers, dates, and unusual facts.
Principle 5: Test diverse tools. Ensure the benchmark covers file reading, computation, web search, and multi-modal tasks - not just web search.
Production Engineering Notesโ
Use GAIA Validation for Developmentโ
The GAIA validation set (with public answers) is appropriate for development iteration. Never tune specifically to the validation set - treat it as a proxy for the test set. Report test set scores for public comparisons.
Track Per-Capability Failure Modesโ
Break your GAIA failures down by required capability: web search, file reading, computation, multi-hop. If you fail 80% of computation tasks but 40% of web search tasks, your bottleneck is clear. Fix the bottleneck before optimizing elsewhere.
Cost Budgetingโ
A GAIA Level 3 task can easily require 30 steps and 50,000 tokens. At scale:
- 115 Level 3 tasks ร 50K tokens ร 86 per evaluation run
Plan accordingly. Use Level 1 tasks for rapid iteration, Level 2-3 for pre-release evaluation.
:::danger Do Not Overfit to the Validation Set GAIA's validation set is public. It is tempting to tune prompts specifically for validation tasks. This produces inflated scores that do not generalize. Treat the validation set as a general capability probe, not a target to optimize. Any prompt change that improves validation performance should be explainable by a general capability improvement - not by memorizing specific question patterns. :::
:::warning Contamination Detection As of 2025, there is evidence that some GAIA validation tasks appear in training data of certain models. When interpreting GAIA scores, always check whether the model has known contamination with the benchmark. HuggingFace provides guidance on checking for overlap with model training sets. :::
Interview Q&Aโ
Q: What is GAIA and what makes it a good agent benchmark?
A: GAIA (General AI Assistants) is a benchmark developed by HuggingFace, Meta, and HEC Paris that tests agents on real-world tasks requiring web search, file reading, code execution, and multi-step reasoning. It is a good benchmark for several reasons: answers are exact and verifiable (no subjective scoring), tasks genuinely require multi-step reasoning (cannot be solved by pattern matching), they require diverse tools (web, file, code), and the human baseline is very high (92%) which means there is meaningful room between current agent performance and human-level. The three difficulty levels (L1 ~10 steps, L2 ~20 steps, L3 30+ steps) allow granular assessment of capability.
Q: An agent scores 70% on GAIA Level 1 but only 25% on Level 3. What does this tell you about the agent?
A: This gap reveals a specific capability limitation: the agent can handle simple single-hop retrieval and computation (Level 1) but breaks down on complex multi-hop reasoning chains (Level 3). The most likely causes are: context window management failure (too many accumulated steps exceed what the model can reason over effectively), compounding error rates (each hop at 85% success = 35% success over 10 hops), poor planning (the agent does not decompose Level 3 problems into manageable sub-questions), or precision errors (small inaccuracies at intermediate steps invalidate the final answer). I would investigate by analyzing which step in failed Level 3 trajectories first deviates from the correct path.
Q: How does GAIA differ from MMLU or similar knowledge benchmarks?
A: MMLU tests knowledge recall - it is essentially a multiple-choice test of what information is in a model's weights. GAIA tests agentic capability - the ability to retrieve, process, and reason over information using tools. An LLM with no tool access scores ~10% on GAIA and ~75% on MMLU. This distinction matters for production: if your agent relies on tools to serve users, MMLU scores are almost irrelevant to predicting production quality. GAIA is a much better proxy for general-purpose agent capability.
Q: What is the "multi-hop compound error" problem in GAIA and how do you address it?
A: In GAIA tasks requiring N hops, if each hop has probability p of being correct, the probability of all N hops being correct is . At p=0.85 and N=10, success probability is only 20%. This is the compound error problem. Mitigation strategies: intermediate verification (after each hop, verify the extracted fact against the source), explicit uncertainty tracking (the agent tracks confidence per step and re-searches when uncertain), decomposition prompting (breaking the task into clearly named sub-questions and solving each independently), and error recovery (when a late-stage fact contradicts earlier findings, backtrack and re-solve the conflicting hop).
Q: You want to build a GAIA-style benchmark for a specific domain. What are the key design principles?
A: The five key principles are: require actual retrieval (answers must not be in training data, must come from specified sources), use exact verifiable answers (numbers, names, dates - not subjective quality), require multi-hop reasoning (no single tool call should suffice), prevent shortcut answers (questions should not be answerable by the most plausible guess), and cover diverse tools (mix web search, file reading, code execution, and multi-modal tasks). The hardest part is "shortcut avoidance" - pilot the tasks with a strong model and see if it answers correctly without using tools. If it does, the task is too easy for a benchmark.
Advanced: GAIA Failure Analysis Pipelineโ
Once you have run your agent against GAIA, systematic failure analysis reveals which capabilities to improve next. Here is a complete failure analysis framework:
from dataclasses import dataclass
from typing import Optional
import json
import re
from collections import defaultdict
@dataclass
class GAIAFailureAnalysis:
instance_id: str
level: int
question: str
expected: str
predicted: Optional[str]
status: str # "wrong_answer", "timeout", "stuck", "format_error"
steps_taken: int
failure_step: Optional[int] # Which step in the trajectory failed
failure_category: Optional[str] # "goal_drift", "hallucination", "unit_error", etc.
trajectory_summary: list[str]
def categorize_failure(
question: str,
expected: str,
predicted: Optional[str],
trajectory: list[str],
) -> str:
"""
Heuristically categorize a GAIA failure by type.
Categories:
- "no_answer": agent did not produce a final answer
- "unit_error": correct number, wrong units
- "precision_error": approximately correct, not exact
- "wrong_source": retrieved from wrong source
- "goal_drift": answered a different question
- "table_misread": likely read a table incorrectly
- "counting_error": off-by-one or counting mistake
- "other": unclassified
"""
if predicted is None:
return "no_answer"
pred_clean = predicted.lower().strip()
exp_clean = expected.lower().strip()
# Check for unit errors: numbers match but text differs
pred_nums = re.findall(r'\d+(?:\.\d+)?', pred_clean)
exp_nums = re.findall(r'\d+(?:\.\d+)?', exp_clean)
if pred_nums and exp_nums and pred_nums[0] == exp_nums[0]:
if pred_clean != exp_clean:
return "unit_error"
# Check for precision errors: predicted is close but not exact
if pred_nums and exp_nums:
try:
pred_val = float(pred_nums[0])
exp_val = float(exp_nums[0])
if abs(pred_val - exp_val) / max(abs(exp_val), 1e-8) < 0.1:
return "precision_error"
except ValueError:
pass
# Check for counting errors (off by one)
if pred_nums and exp_nums:
try:
pred_val = int(float(pred_nums[0]))
exp_val = int(float(exp_nums[0]))
if abs(pred_val - exp_val) == 1:
return "counting_error"
except ValueError:
pass
# Heuristic: table questions that fail often involve table misreads
table_keywords = ["table", "column", "row", "spreadsheet", "excel", "csv"]
if any(kw in question.lower() for kw in table_keywords):
return "table_misread"
# Heuristic: if trajectory is long but answer is wrong, likely goal drift
if len(trajectory) > 15:
return "goal_drift"
return "other"
def analyze_failures(
results: list[dict],
trajectory_data: dict, # {instance_id: list[str]}
) -> dict:
"""
Perform systematic failure analysis on GAIA results.
Returns actionable breakdown by failure category.
"""
failures = [r for r in results if not r["correct"]]
category_counts = defaultdict(int)
level_category_counts = defaultdict(lambda: defaultdict(int))
analyses = []
for r in failures:
trajectory = trajectory_data.get(r.get("task_id", ""), [])
category = categorize_failure(
r["question"],
r["expected"],
r.get("predicted"),
trajectory,
)
category_counts[category] += 1
level_category_counts[r["level"]][category] += 1
analyses.append(GAIAFailureAnalysis(
instance_id=r.get("task_id", ""),
level=r["level"],
question=r["question"][:100],
expected=r["expected"],
predicted=r.get("predicted"),
status="wrong_answer" if r.get("predicted") else "no_answer",
steps_taken=r.get("steps", 0),
failure_step=None, # Would require more detailed trajectory analysis
failure_category=category,
trajectory_summary=trajectory[:5],
))
total_failures = len(failures)
total_tasks = len(results)
return {
"summary": {
"total_tasks": total_tasks,
"total_failures": total_failures,
"failure_rate": total_failures / total_tasks if total_tasks > 0 else 0,
},
"failure_categories": {
cat: {
"count": count,
"rate": count / total_failures if total_failures > 0 else 0,
"recommended_fix": {
"no_answer": "Add explicit FINAL ANSWER: format requirement; increase max iterations",
"unit_error": "Add unit normalization step to agent reasoning chain",
"precision_error": "Require agent to compute exact value, not approximate",
"counting_error": "Add explicit verification step for counting tasks",
"table_misread": "Use structured table parser instead of raw text extraction",
"goal_drift": "Add goal re-statement at each reasoning step",
"other": "Manual review required - examine specific failure cases",
}.get(cat, "Review specific cases"),
}
for cat, count in sorted(category_counts.items(), key=lambda x: x[1], reverse=True)
},
"by_level": {
f"level_{level}": dict(cats)
for level, cats in level_category_counts.items()
},
}
def print_failure_report(report: dict):
"""Print a human-readable failure analysis report."""
summary = report["summary"]
print(f"\nโโ GAIA Failure Analysis โโโโโโโโโโโโโโโโโโโโโโโโ")
print(f"Total tasks: {summary['total_tasks']}")
print(f"Total failures: {summary['total_failures']} ({summary['failure_rate']:.1%})")
print(f"\nFailure categories (most common first):")
for cat, data in report["failure_categories"].items():
print(f"\n {cat}: {data['count']} ({data['rate']:.1%} of failures)")
print(f" Recommended fix: {data['recommended_fix']}")
print(f"\nBy level:")
for level_key, cats in report["by_level"].items():
print(f"\n {level_key}:")
for cat, count in sorted(cats.items(), key=lambda x: x[1], reverse=True):
print(f" {cat}: {count}")
Using Failure Analysis to Guide Improvementโ
The failure analysis output tells you where to invest improvement effort:
| Dominant Failure Category | Root Cause | Fix |
|---|---|---|
no_answer | Agent times out or gets stuck | Increase max_iterations; add explicit stopping criteria |
unit_error | Agent retrieves right number, wrong unit | Add unit normalization to reasoning chain |
precision_error | Agent rounds or estimates | Require exact computation with Python interpreter |
table_misread | Agent reads wrong row/column | Add structured table parsing tool |
goal_drift | Agent forgets original question | Add goal re-statement to system prompt |
counting_error | Off-by-one in counting tasks | Add explicit verify-by-counting step |
The power of systematic failure analysis: instead of vaguely knowing "the agent is not good at Level 3," you know specifically that 38% of Level 3 failures are goal_drift and 24% are table_misread. These two categories have known fixes that can be implemented and verified in a week.
GAIA as a Capability Radarโ
Thinking of GAIA as a radar chart of agent capabilities - not a single score - gives a more complete picture:
The radar reveals a consistent pattern across all current systems: single-step retrieval is reliable, multi-step reasoning is inconsistent, and complex planning with backtracking is the frontier. Agent development efforts that target the "weak" category produce the most GAIA score improvement per engineering-hour invested.
GAIA is ultimately not a score to maximize - it is a diagnostic tool. Its value is in telling you exactly where your agent's capabilities break down, with enough granularity to guide targeted improvements. Use it that way, and it will pay dividends across your entire evaluation and improvement workflow.
:::tip Start With Level 2 When first running your agent against GAIA, focus your analysis on Level 2 results. Level 1 is too easy to be diagnostic for capable agents, and Level 3 requires capabilities most agents do not yet have. Level 2 questions - requiring 5-15 steps across multiple tools - are exactly in the range where agent architectural choices (planning strategy, context management, tool design) make the most difference. Improving Level 2 performance from 30% to 50% is achievable in weeks; it reveals your most impactful capability gaps; and it correlates better with real-world assistant usefulness than either Level 1 or Level 3 scores. :::
Further Readingโ
- Mialon et al. (2023), "GAIA: a benchmark for General AI Assistants" - the original paper defining the benchmark and analyzing initial model performance
- The GAIA leaderboard on HuggingFace Hub (
gaia-benchmark/GAIA) - current state-of-the-art scores and submission guidelines - Lesson 04 of this module - SWE-bench Verified, the complementary benchmark for coding-specific agent capability
- Lesson 05 - LLM-as-Agent-Judge, for evaluating agents on tasks where exact-match is not the right scoring approach
:::tip Running GAIA for the First Time Start with the GAIA validation set (answers are public), run your agent on Level 1 only (165 tasks), and measure your baseline. Level 1 is where all production-grade agents should score above 60%. If you score below 50% on Level 1, there is a fundamental tool use or instruction-following problem to fix before addressing Level 2 or 3. Use the failure analysis framework in this lesson to identify whether the failures cluster around a specific category - and fix that category before re-running. This systematic approach typically improves Level 1 scores by 15-20 percentage points in the first week of iteration. :::
Relationship to Other Evaluation Methodsโ
GAIA is one part of a complete agent evaluation strategy. Here is how it fits with the other techniques in this module:
| Method | What It Measures | When to Use |
|---|---|---|
| GAIA | General multi-step reasoning with diverse tools | Overall capability assessment, SOTA comparison |
| SWE-bench Verified | Coding-specific bug fixing | Coding agent evaluation |
| Trajectory Evaluation | Step efficiency, backtracking, cost | Any agent, continuous CI/CD monitoring |
| LLM Judge | Holistic quality on open-ended tasks | Customer support, research, writing agents |
| Human Evaluation | Ground truth quality signal | Calibration, safety review, novel task types |
| Production Monitoring | Real-world performance on actual users | After deployment, continuous quality tracking |
No single method is sufficient. GAIA tells you whether your agent can handle complex real-world tasks - but it does not tell you whether it is efficient (trajectory evaluation), safe (human evaluation), or performing well on your specific user base (production monitoring). Build the full stack.
The investment in GAIA evaluation is justified by what it reveals: systematic capability gaps that internal testing consistently misses, because internal tests are written by the same engineers who built the agent and therefore target the capabilities the agent already has. GAIA tests capabilities the agent might not have, on task types nobody on your team thought to include in the test suite. That is its irreplaceable value.
:::note Benchmark Literacy When reading AI research papers or vendor claims, always ask: which benchmark? at which difficulty level? with which tools? by which scoring method? A system that reports "60% on GAIA" without specifying whether that is the full benchmark, the validation set, Level 1 only, or a cherry-picked subset is reporting a number that cannot be interpreted. The benchmark details matter as much as the score. This applies to GAIA, SWE-bench, and every other evaluation framework. :::
The ability to critically interpret benchmark claims is as valuable as the ability to run evaluations yourself. When GAIA scores appear in papers, press releases, or vendor pitches, you now have the framework to evaluate what those scores actually mean - and what questions to ask when they do not tell the complete story.
Continue to the next lesson - SWE-bench Verified - for the domain-specific complement to GAIA: rigorous evaluation of coding agents on real production issues.
Every point gained on GAIA Level 3 represents a genuine capability improvement that will transfer to real production tasks. Measure it carefully, understand what drives it, and let the data guide your next iteration.
