Skip to main content

Benchmarks: WebArena and OSWorld

What 15% Really Means

When WebArena was released in 2023, the best agents achieved about 15% on its 812 web tasks. Headlines ran: "AI Still Can't Browse the Web." Tech commentators called the result "disappointing."

They were wrong to be disappointed.

The tasks in WebArena are not simple. "Find the GitLab issue with the most comments in the project X and add a 'needs-review' label to it." "On the shopping site, find all products with a 4-star average rating under $50 and add the cheapest one to cart." "Post a reply to the most recent thread in the Support forum that asks about shipping, with the shipping policy copied from the FAQ page."

These tasks require navigating multiple pages, reading and understanding content, making decisions, and taking multi-step actions - all without making a single mistake that breaks the task. A human takes 2–4 minutes per task. An agent at 15% solved 124 of 812 such tasks correctly. That is 124 tasks a human would have spent time on - automated.

Understanding what benchmarks actually measure, how they are constructed, and how current SOTA systems perform gives you the grounding to evaluate your own computer use agents realistically. That is what this lesson covers.


:::tip 🎮 Interactive Playground Visualize this concept: Try the Agent Evaluation demo on the EngineersOfAI Playground - no code required. :::

WebArena: The Web Navigation Standard

WebArena was released in December 2023 by researchers at Carnegie Mellon, Ohio State, and MIT. It is the most widely cited web agent benchmark and the one most teams use to track progress.

Task Construction

WebArena tasks are constructed to be:

  • Realistic: Based on actual workflows users perform on these types of sites
  • Reproducible: Run against locally hosted versions of the sites (not live internet) to ensure consistency
  • Unambiguous: Each task has a specific, checkable answer

Tasks are templated from a set of base tasks with varied parameters. For example: "Find [product_name] on [shop_site] and check if it's available in [color]." The templates are filled with concrete values from the seeded test data.

Evaluation Method

Evaluation is functional: did the agent accomplish the task, not did it take the right steps?

Three evaluation types:

  1. Exact match: The agent's final output matches the expected answer exactly
  2. URL match: The agent navigated to the expected URL
  3. Program verification: A verifier script checks the site's database to confirm the action was taken (e.g., was the item added to cart? Was the comment posted?)

This makes WebArena rigorous - the agent cannot fake success by writing the right text without actually performing the action.

WebArena Results: Current SOTA

SystemScoreNotes
Random agent1.1%Baseline
GPT-4 + text-only8.6%Original paper baseline
Claude 3.5 Sonnet (2024)~39%With best prompting
SWE-agent on WebArena~18%Specialized agent
Human78.24%Human upper bound

Notable patterns in the results:

  • Shopping tasks: agents perform best (~25–35%)
  • GitLab tasks: medium performance (~20–30%)
  • Multi-site tasks: worst performance (~10–15%) - require cross-site reasoning
  • Long-horizon tasks (>10 steps): performance drops significantly

WebArena Lite

WebArena Lite is a 165-task subset selected for:

  • Shorter completion time
  • Faster evaluation
  • Use in development and ablation studies

Most teams use Lite for iterating on agent design, full WebArena for final evaluation.


OSWorld: Desktop and Application Benchmarks

WebArena tests web navigation exclusively. OSWorld (released March 2024, University of Hong Kong) extends evaluation to desktop environments - the full OS, including desktop applications.

OSWorld Setup

OSWorld provides:

  • A VMware virtual machine with Ubuntu and pre-installed applications
  • 369 tasks across multiple application domains
  • Snapshot-based reproducibility: each task starts from a clean state
  • A diverse set of observation types (screenshot only, accessibility tree, both)

OSWorld Task Domains

DomainTasksExamples
OS file management59Move files, change permissions, create archives
Web browser68Chrome with complex workflows, extensions
Productivity (LibreOffice)97Writer, Calc, Impress
Multimedia28VLC, GIMP, video editing
Multi-app workflows117Tasks spanning 2–4 applications

OSWorld Results

SystemScore (screenshot only)Score (a11y tree)
Random baseline0.5%0.5%
GPT-4V (2024)5.6%11.2%
Claude 3.5 Sonnet22.0%25.1%
Human72.4%72.4%

The "accessibility tree" observation gives the agent structured information about UI elements (their roles, labels, positions) in addition to the screenshot. Agents with a11y tree access significantly outperform screenshot-only agents, but screenshot-only agents are more general (they work on any application, not just ones that expose accessibility data).

Hardest tasks in OSWorld:

  • LibreOffice Calc formula tasks (~8% success)
  • Multi-app workflows requiring state transfer between apps (~12% success)
  • Multimedia editing (precise control of visual elements) (~5% success)

Easiest tasks in OSWorld:

  • Simple file management (move, copy, rename) (~45% success)
  • Web browsing with simple navigation (~35% success)

ScreenSpot: GUI Grounding

ScreenSpot is specifically a GUI grounding benchmark - it tests whether models can identify and locate UI elements from natural language descriptions, without requiring task completion.

What ScreenSpot Tests

Given a screenshot and a text description like "the button to submit the form" or "the search bar at the top," ScreenSpot asks: at what pixel coordinates is this element?

This tests the fundamental capability that all computer use agents depend on: visual grounding - mapping language to screen coordinates.

ScreenSpot Design

  • 1,272 screenshots from mobile and desktop applications
  • Diverse UI types: web, mobile apps, desktop apps
  • Element types: buttons, inputs, icons, links, menus
  • Platform coverage: iOS, Android, macOS, Windows, web

ScreenSpot Results

ModelMobileDesktopWebOverall
Chance0.04%0.03%0.10%0.06%
GPT-4V22.6%20.3%18.8%20.4%
Claude 3.5 Sonnet45.2%38.7%52.9%45.5%
SeeClick (specialized)53.4%35.5%28.3%41.5%
CogAgent (specialized)67.0%74.2%70.4%70.4%

Key insight: specialized GUI grounding models (SeeClick, CogAgent) can outperform general vision-language models on this task, suggesting that future computer use agents will benefit from specialized grounding components.


Mind2Web: Real-World Web Diversity

Mind2Web tests web navigation across a much larger set of real websites (2,000+ diverse sites), rather than WebArena's 5 pre-configured local sites.

What Makes Mind2Web Different

  • Real websites: not locally hosted mock environments
  • 2,350 tasks across 137 real websites from 31 categories
  • Three evaluation splits: in-domain, cross-task, cross-website
  • Tests generalization: can an agent trained on some sites navigate unfamiliar ones?

Mind2Web Evaluation

Rather than functional success (did it work?), Mind2Web evaluates:

  • Element accuracy: Did the agent click the right element?
  • Operation F1: How well did the planned actions match the reference actions?
  • Step success rate: What fraction of individual steps were correct?

This makes it easier to evaluate without running actual browser sessions for every test.

Mind2Web Insights

The cross-website split is the most important: it tests whether agents can generalize to unfamiliar site layouts. Performance typically drops 15–25% from in-domain to cross-website, indicating that agents still rely heavily on layout patterns seen during training.


What the Benchmark Numbers Actually Mean

Understanding benchmarks requires understanding their limitations.

1. Benchmark tasks are not randomly sampled from all possible tasks. WebArena tasks were designed by researchers to be representative but tractable. Tasks that are trivially easy (navigate to the homepage) or impossibly hard (solve a complex multi-step business problem) are excluded. This selection bias means benchmark performance does not directly map to real-world performance across all possible tasks.

2. Benchmark environments differ from production environments. WebArena uses locally hosted clones of websites that look like real sites but have controlled data. Production sites have more variation, more edge cases, and more unpredictable states. Agents that score 35% on WebArena often achieve 20–25% on equivalent real-world tasks.

3. Prompt engineering significantly affects results. A 10–15% difference in WebArena score is achievable purely through prompt engineering, without changing the underlying model. Published numbers vary based on the prompting approach used.

4. Partial credit is not reflected in binary scores. If an agent completes 8 of 10 steps correctly but fails the final step, it gets 0% credit. This means agents that nearly complete many tasks look similar to agents that fail completely on all tasks.

5. Latency and cost are not captured. A system that achieves 40% in 2 minutes and one that achieves 40% in 20 minutes score identically. For production use, latency matters enormously.


Where Agents Excel and Fail

Based on benchmark analysis and production experience:

Where agents reliably succeed:

  • Standard login flows (username + password, no 2FA)
  • Form submission with well-labeled fields
  • Navigation through paginated content with "Next" buttons
  • Extracting data from structured tables and lists
  • Following step-by-step instructions on consistent layouts

Where agents reliably fail:

  • CAPTCHA solving (reCAPTCHA v2 image selection, hCaptcha)
  • Dynamic UIs that re-render elements mid-task
  • Tasks requiring precise spatial judgment (resizing, dragging sliders)
  • Long tasks (>30 steps) - errors compound and context fills
  • Sites using anti-bot measures that block headless browsers

Building Your Own Evaluation Suite

Benchmarks like WebArena test general capability. For your specific use case, you need a custom evaluation suite.

"""
agent_evaluator.py

Build and run a custom evaluation suite for your computer use agent.
Tracks success rate, latency, cost, and qualitative failure analysis.
"""

import anthropic
import json
import time
from dataclasses import dataclass, field
from typing import Optional, Callable
from pathlib import Path


@dataclass
class EvalTask:
"""A single evaluation task."""
task_id: str
description: str # Natural language task description
start_url: str
success_criteria: dict # What constitutes success
expected_steps: int = 10 # Rough expected step count
timeout_seconds: int = 120
tags: list = field(default_factory=list)


@dataclass
class EvalResult:
"""Result of running an evaluation task."""
task_id: str
success: bool
steps_taken: int
duration_seconds: float
estimated_cost_usd: float
failure_reason: Optional[str] = None
final_screenshot_path: Optional[str] = None
notes: str = ""


class AgentEvaluator:
"""
Evaluates a computer use agent against a task suite.
Tracks metrics and generates evaluation reports.
"""

def __init__(self, agent_factory: Callable, output_dir: str = "/tmp/eval"):
"""
agent_factory: Callable that returns a configured agent instance.
"""
self.agent_factory = agent_factory
self.output_dir = Path(output_dir)
self.output_dir.mkdir(parents=True, exist_ok=True)
self.results: list[EvalResult] = []

def create_shopping_tasks(self) -> list[EvalTask]:
"""Example tasks for an e-commerce shopping agent evaluation."""
return [
EvalTask(
task_id="shop-001",
description="Find the cheapest laptop with at least 8GB RAM",
start_url="https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops",
success_criteria={
"type": "extracted_data",
"required_fields": ["name", "price"],
"validation": lambda data: data.get("price", 0) > 0
},
expected_steps=8,
tags=["extraction", "comparison"],
),
EvalTask(
task_id="shop-002",
description="Find all tablets priced under $300 and count them",
start_url="https://webscraper.io/test-sites/e-commerce/allinone/computers/tablets",
success_criteria={
"type": "count",
"expected_range": (1, 50)
},
expected_steps=5,
tags=["extraction", "counting"],
),
EvalTask(
task_id="shop-003",
description="Navigate to the phones section and find the item with the most reviews",
start_url="https://webscraper.io/test-sites/e-commerce/allinone",
success_criteria={
"type": "navigation_success",
"url_contains": "phones"
},
expected_steps=6,
tags=["navigation", "comparison"],
),
]

def run_task(self, task: EvalTask) -> EvalResult:
"""Run a single evaluation task and return the result."""
print(f"\nRunning: {task.task_id} - {task.description[:60]}")
start_time = time.time()

agent = self.agent_factory()

try:
result = agent.run(
task.description,
max_steps=task.expected_steps * 3, # Allow 3x expected
)

duration = time.time() - start_time
success = self._check_success(result, task.success_criteria)

# Estimate cost (rough: $0.01/step)
estimated_cost = result.get("steps", 0) * 0.01

eval_result = EvalResult(
task_id=task.task_id,
success=success,
steps_taken=result.get("steps", 0),
duration_seconds=duration,
estimated_cost_usd=estimated_cost,
failure_reason=None if success else result.get("result", "Unknown"),
)

except Exception as e:
duration = time.time() - start_time
eval_result = EvalResult(
task_id=task.task_id,
success=False,
steps_taken=0,
duration_seconds=duration,
estimated_cost_usd=0.0,
failure_reason=f"Exception: {str(e)}",
)

self.results.append(eval_result)
status = "PASS" if eval_result.success else "FAIL"
print(f" Result: {status} | Steps: {eval_result.steps_taken} | "
f"Time: {eval_result.duration_seconds:.1f}s | "
f"Cost: ${eval_result.estimated_cost_usd:.3f}")

return eval_result

def _check_success(self, agent_result: dict,
criteria: dict) -> bool:
"""Check if agent result meets success criteria."""
criteria_type = criteria.get("type")

if criteria_type == "extracted_data":
data = agent_result.get("data", {})
if not data:
return False
validation_fn = criteria.get("validation")
if validation_fn and not validation_fn(data):
return False
required_fields = criteria.get("required_fields", [])
return all(field in data for field in required_fields)

elif criteria_type == "navigation_success":
return agent_result.get("success", False)

elif criteria_type == "count":
count = agent_result.get("data", {}).get("count", 0)
min_count, max_count = criteria.get("expected_range", (0, 1000))
return min_count <= count <= max_count

return agent_result.get("success", False)

def run_suite(self, tasks: list[EvalTask]) -> dict:
"""Run all tasks and return aggregate metrics."""
print(f"\nStarting evaluation suite: {len(tasks)} tasks")
print("=" * 60)

for task in tasks:
self.run_task(task)

return self.generate_report()

def generate_report(self) -> dict:
"""Generate evaluation report."""
if not self.results:
return {"error": "No results"}

total = len(self.results)
successful = sum(1 for r in self.results if r.success)
success_rate = successful / total

avg_steps = sum(r.steps_taken for r in self.results) / total
avg_duration = sum(r.duration_seconds for r in self.results) / total
total_cost = sum(r.estimated_cost_usd for r in self.results)

report = {
"summary": {
"total_tasks": total,
"successful": successful,
"failed": total - successful,
"success_rate": f"{success_rate:.1%}",
"avg_steps_per_task": f"{avg_steps:.1f}",
"avg_duration_seconds": f"{avg_duration:.1f}",
"total_estimated_cost": f"${total_cost:.3f}",
},
"results": [
{
"task_id": r.task_id,
"success": r.success,
"steps": r.steps_taken,
"duration": f"{r.duration_seconds:.1f}s",
"cost": f"${r.estimated_cost_usd:.3f}",
"failure_reason": r.failure_reason,
}
for r in self.results
],
"failure_analysis": [
{
"task_id": r.task_id,
"reason": r.failure_reason,
}
for r in self.results if not r.success
]
}

# Save report
report_file = self.output_dir / "eval_report.json"
report_file.write_text(json.dumps(report, indent=2))
print(f"\nReport saved: {report_file}")

print("\n" + "=" * 60)
print("EVALUATION SUMMARY")
print("=" * 60)
for k, v in report["summary"].items():
print(f" {k:<35} {v}")

return report


# --- Metrics to Track ---

def compute_benchmark_metrics(results: list[EvalResult]) -> dict:
"""
Compute standard benchmark metrics comparable to WebArena/OSWorld.
"""
total = len(results)
if total == 0:
return {}

success_rate = sum(r.success for r in results) / total

# By step count bucket (measures task difficulty handling)
easy = [r for r in results if r.steps_taken <= 5]
medium = [r for r in results if 5 < r.steps_taken <= 15]
hard = [r for r in results if r.steps_taken > 15]

return {
"overall_success_rate": success_rate,
"easy_tasks_success_rate": (
sum(r.success for r in easy) / len(easy) if easy else None
),
"medium_tasks_success_rate": (
sum(r.success for r in medium) / len(medium) if medium else None
),
"hard_tasks_success_rate": (
sum(r.success for r in hard) / len(hard) if hard else None
),
"avg_steps": sum(r.steps_taken for r in results) / total,
"avg_cost_per_task": sum(r.estimated_cost_usd for r in results) / total,
"cost_per_successful_task": (
sum(r.estimated_cost_usd for r in results) /
max(sum(r.success for r in results), 1)
),
"total_results": total,
}


if __name__ == "__main__":
import os

# Example: evaluate a simple agent
# In practice, replace with your actual agent
def mock_agent_factory():
class MockAgent:
def run(self, task, max_steps=30):
# Simulate 60% success rate
import random
success = random.random() < 0.6
steps = random.randint(3, max_steps // 2)
return {
"success": success,
"steps": steps,
"data": {"name": "Test", "price": 99.99} if success else None,
"result": "completed" if success else "failed"
}
return MockAgent()

evaluator = AgentEvaluator(
agent_factory=mock_agent_factory,
output_dir="/tmp/agent_eval"
)

tasks = evaluator.create_shopping_tasks()
report = evaluator.run_suite(tasks)

The Benchmark Gap: Lab vs Production

One of the most important things to understand about computer use benchmarks is the gap between benchmark performance and production performance.

Reasons for the gap:

  1. Environment control: Benchmarks run against controlled, stable test environments. Production sites change layout, run A/B tests, have different data, and behave differently at different times of day.

  2. Task selection bias: Benchmark tasks are chosen to be feasible. Production tasks include all the edge cases, ambiguous instructions, and impossible requests that benchmarks exclude.

  3. Error recovery: Benchmark scoring is binary (pass/fail). In production, partial completion has value. An agent that completes 8 of 10 steps correctly may still provide significant value even if it fails the task by benchmark criteria.

  4. Infrastructure differences: Benchmarks assume a clean, fast internet connection. Production environments may have slow page loads, intermittent connectivity, and CDN-related rendering differences.

Rule of thumb: Expect production performance to be 30–50% lower than benchmark performance on comparable tasks. An agent scoring 35% on WebArena might achieve 20–25% on real production web tasks with similar difficulty.


:::warning Benchmark Score Gaming

Some published results achieve high benchmark scores through approaches that do not generalize to production:

  1. Dataset leakage: If the benchmark tasks appear in the model's training data, performance is artificially inflated
  2. Environment-specific tuning: Prompts and few-shot examples tuned specifically for WebArena's 5 sites won't generalize
  3. Human-assisted trajectories: Some approaches use human demonstration data that would not be available in production

When evaluating a system for production use, run it on your own task suite against your actual target sites - not just on WebArena or OSWorld.

:::


Interview Questions and Answers

Q: What is WebArena and what makes it a good benchmark for computer use agents?

A: WebArena is a benchmark containing 812 realistic web tasks across 5 locally hosted websites (e-commerce, GitLab, Reddit, CMS, OpenStreetMap). It's good because: (1) tasks are functional - evaluated by whether the action actually succeeded, not whether the right steps were taken; (2) it uses reproducible local environments so results are consistent across evaluations; (3) tasks require multi-step reasoning (not just single clicks); (4) it includes diverse task types (search, navigate, create, compare, extract). The main limitation is that 5 local sites don't capture the full diversity of the real internet.

Q: Compare WebArena and OSWorld - what does each test that the other doesn't?

A: WebArena focuses exclusively on web browser navigation across 5 specific websites. OSWorld tests a broader range: desktop applications (LibreOffice, GIMP, VLC), OS-level file management, and multi-application workflows that span several programs. OSWorld is harder partly because desktop applications have more diverse UI patterns and less predictable layouts than web pages. OSWorld also uses actual VMs rather than locally hosted web servers, making it closer to real production environments. Use WebArena for evaluating web-specific agents, OSWorld for evaluating general desktop automation capability.

Q: What is ScreenSpot and why does it matter for computer use agent development?

A: ScreenSpot is a GUI grounding benchmark that tests a single capability: given a screenshot and a text description of a UI element, can the model identify the correct pixel coordinates? It's important because grounding is the foundational skill that all computer use agents depend on - every click, every form fill, every scroll requires accurate grounding. Unlike WebArena (which measures end-to-end task success), ScreenSpot isolates this specific capability, making it useful for diagnosing why an agent fails: if grounding accuracy is poor, improving the navigation logic won't help.

Q: Current SOTA on WebArena is about 39%. Does that mean computer use agents are only useful 39% of the time?

A: Not at all. First, WebArena tasks are designed to be challenging - multi-step, requiring perfect execution. Real production deployments often have simpler, more structured tasks where agents perform better. Second, the 61% failure rate includes partial completions (agent completed 8 of 10 steps, scored 0%). In production, partial completion often still provides value. Third, WebArena performance has improved from ~1% (random) to ~39% in about 18 months - the trajectory is strongly upward. Finally, production deployment filters tasks: use agents for tasks they reliably handle (structured forms, consistent layouts) and humans for tasks they fail (ambiguous instructions, CAPTCHA-heavy sites).

Q: How would you build a custom evaluation suite for a computer use agent deployed on a specific company's internal tooling?

A: Steps: (1) Document 50–100 representative tasks that agents will be asked to perform, with varying complexity (easy: 5 steps, hard: 20+ steps). (2) Record example human executions to establish ground truth. (3) Define success criteria for each task - what state must the system be in at the end? This may require a verification script that checks database state or page content. (4) Create reproducible starting conditions - database snapshots, page states, pre-loaded data. (5) Run evaluations regularly (weekly or per-release) and track success rate over time. (6) Analyze failures by category: navigation errors, extraction errors, form fill errors. Use failure analysis to prioritize improvements. Track not just success rate but also cost-per-task and duration-per-task for ROI analysis.

© 2026 EngineersOfAI. All rights reserved.