SWE-bench and Evaluation
Imagine you have built a coding agent. It seems to work well on the examples you tried. Your colleague's agent also seems to work well on different examples. How do you know which one is better? How do you know if either one is actually good?
This is not a rhetorical question. Until 2023, there was no rigorous answer. Coding agents were evaluated informally - "here's a demo where it fixed this bug" - and benchmark claims were almost impossible to compare across systems.
SWE-bench changed that. Published in October 2023 by researchers at Princeton University and the University of Chicago, SWE-bench created a standardized, reproducible evaluation framework for software engineering agents. For the first time, you could compare Devin to Claude to GPT-4 on the same 2,294 real-world tasks with a single number.
Understanding SWE-bench is not just academic. It tells you what current agents can and cannot do, where they fail, and how to think about building better evaluation for your own use case.
Why Evaluation Matters More Than You Think
When Cognition AI announced Devin in March 2024, the headline was "first AI software engineer." The number was 13.86% on SWE-bench.
That number provoked an immediate response from the research community. Researchers at Princeton independently evaluated Devin and found the actual number was closer to 4.8% under controlled conditions. The difference came from evaluation methodology: which instances were selected, whether the agent had internet access during evaluation, how Docker environments were configured.
This is not a story about one company exaggerating results. It is a story about evaluation methodology being as important as the agent itself. When evaluation is ambiguous, everyone measures something different and the numbers are not comparable.
SWE-bench's value is that it pins down every variable: the exact instances, the exact test suites, the exact evaluation conditions. When Claude scores 57% and another system scores 40%, you know those numbers are measuring the same thing.
:::tip 🎮 Interactive Playground Visualize this concept: Try the Coding Agent Loop demo on the EngineersOfAI Playground - no code required. :::
The SWE-bench Paper: Origins and Design
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan published SWE-bench in October 2023.
Their core insight was elegant: GitHub pull requests are already labeled with the problem they solve. The PR description says "this fixes the bug where X happens when Y." The PR adds or modifies test cases that verify the fix. And the original failing tests define what "fixed" means.
This means you can automatically collect thousands of verified software engineering tasks from public repositories, without having to write any evaluation criteria manually.
Dataset Construction
The team scraped GitHub for merged pull requests from popular Python repositories that:
- Added or modified test files
- Had a clear issue reference or bug description
- Could be replicated in a Docker container
From this process, they identified 12 repositories and 2,294 tasks:
| Repository | Domain | Tasks |
|---|---|---|
| django/django | Web framework | 850 |
| scikit-learn/scikit-learn | Machine learning | 224 |
| matplotlib/matplotlib | Visualization | 184 |
| sympy/sympy | Symbolic math | 386 |
| astropy/astropy | Astronomy | 101 |
| pytest-dev/pytest | Testing framework | 83 |
| pylint-dev/pylint | Code analysis | 64 |
| pydicom/pydicom | Medical imaging | 57 |
| sphinx-doc/sphinx | Documentation | 185 |
| psf/requests | HTTP library | 60 |
| pallets/flask | Web framework | 31 |
| sqlfluff/sqlfluff | SQL linting | 69 |
The Evaluation Protocol
For each task, the agent receives:
- The repository state before the fix
- The issue description (what the bug is)
- (In some variants) The failing test
The agent must modify the repository such that the held-out test suite passes. The agent does not see the tests - it must figure out what to change based on the issue description alone.
This is exactly what a human developer does: read a bug report, find the relevant code, fix it, and verify it works.
SWE-bench Architecture
Binary Scoring
SWE-bench uses binary scoring: each task is either resolved (all tests pass) or unresolved (any test fails). There is no partial credit.
This is intentional. Real software engineering also has binary outcomes: the code works or it doesn't. A fix that almost passes tests is not a fix.
SWE-bench Variants
SWE-bench Lite
300 randomly sampled tasks from the full 2,294. Used for faster evaluation during development - running the full benchmark requires building 2,294 Docker containers and can take hours even on fast hardware.
SWE-bench Verified
The most important variant for current evaluation. Released in 2024 by Anthropic in collaboration with OpenAI.
The problem with the original SWE-bench: many of the 2,294 tasks have issues that make them unsuitable for evaluation. Some tasks are underspecified. Some require information not provided in the issue. Some have environment-specific failures that have nothing to do with the agent's code changes.
SWE-bench Verified addresses this by having human software engineers manually review and verify 500 tasks. Each task was labeled:
- Is the issue description complete and unambiguous?
- Can this be solved from the information given?
- Are the tests actually testing what the issue describes?
Only tasks that passed human review are included. The result is a cleaner, fairer subset that better reflects real-world software engineering difficulty.
When you see benchmark claims, always check: is this SWE-bench, SWE-bench Lite, or SWE-bench Verified? The numbers are not comparable across variants.
SWE-bench Multilingual (MultiSWE-bench)
Published in 2025 to address the Python-only limitation. Extends the benchmark to:
- Java (from popular Spring and Apache projects)
- TypeScript (from major Node.js libraries)
- Go (from Go standard library tooling)
- Rust (from core Rust crates)
- C++ (from LLVM and other projects)
This matters because Python-specialized agents perform differently on statically-typed languages. MultiSWE-bench reveals whether agents have learned general software engineering skills or just Python-specific patterns.
Current SOTA: What the Numbers Mean
| System | Benchmark | Score | Date |
|---|---|---|---|
| Claude Sonnet (Anthropic) | SWE-bench Verified | ~57% | 2025 |
| Claude Opus (Anthropic) | SWE-bench Verified | ~49% | 2024 |
| GPT-4o (OpenAI) | SWE-bench Verified | ~38% | 2024 |
| OpenHands (Community) | SWE-bench Verified | ~35% | 2025 |
| SWE-agent (Princeton) | SWE-bench (full) | ~12% | 2024 |
| Devin (Cognition) | SWE-bench (full) | ~13.86% | 2024 |
What does 57% actually mean?
It means that given a real GitHub bug report, with no human assistance, Claude autonomously fixed the bug such that the original test suite passes - more than half the time.
To put this in human terms: imagine hiring a junior developer and giving them 100 bug reports. If they resolved 57 of them independently in under an hour each, you would be pleased with their performance. That is what this number represents.
The 43% it cannot solve is equally informative. That represents:
- Tasks requiring complex multi-file architectural changes
- Tasks where the issue description is ambiguous
- Tasks in unfamiliar domains or codebases
- Tasks requiring knowledge that is not in the code (external APIs, deployment environments)
- Tasks where the fix requires non-obvious domain expertise
Failure Mode Analysis
Understanding where agents fail is as important as knowing their score. Analysis of SWE-bench trajectories reveals consistent patterns:
Failure Type 1: Incorrect Diagnosis
The agent reads the issue, finds the symptom location, and fixes the symptom rather than the cause. The test still fails because the root cause is elsewhere.
Example: Issue says "function returns None when called with empty list." Agent adds a null check at the call site. But the actual bug is in a utility function three calls deep that corrupts the return value.
Frequency: ~25% of failures.
Failure Type 2: Correct Fix, Wrong Location
The agent understands what needs to change but modifies the wrong file or wrong function. The logic change is correct; the location is wrong.
Example: Issue involves a Django model. Agent fixes the view function (correct logic) but the bug is actually in the model's save() method.
Frequency: ~15% of failures.
Failure Type 3: Context Exhaustion
The agent runs out of context window before completing the task. It fills context with irrelevant files, repeated tool calls, or verbose reasoning, and then cannot proceed.
Frequency: ~20% of failures on complex, multi-file tasks.
Failure Type 4: Test Suite Blindness
The agent makes a change that fixes the described behavior but breaks other tests it did not run. The final evaluation fails because unrelated tests regressed.
Frequency: ~10% of failures.
Failure Type 5: Edit Execution Errors
The agent's edit command fails - old_str not found due to whitespace differences, encoding issues, or the file being different than expected. The agent gives up or produces a corrupted file.
Frequency: ~8% of failures.
Failure Type 6: Scope Creep
The agent decides to refactor, improve, or "clean up" code beyond what the task requires. These changes introduce unintended side effects.
Frequency: ~7% of failures.
Repository-Specific Performance
Performance varies dramatically across repositories. Analysis reveals:
Higher-performing repos (better scores):
requests- Clean, well-organized, simple bugsflask- Compact codebase, good documentationpytest- Well-defined behavior, excellent test suite
Lower-performing repos (worse scores):
sympy- Complex mathematics requiring domain knowledgematplotlib- Visual output hard to verify programmaticallydjango- Large codebase, complex ORM interactions, subtle behavioral dependencies
The lesson: agent performance is heavily influenced by codebase properties, not just task difficulty.
Setting Up SWE-bench Locally
Running SWE-bench requires Docker and the evaluation harness. Here is how to set it up:
# Clone the SWE-bench repository
git clone https://github.com/princeton-nlp/SWE-bench.git
cd SWE-bench
# Install dependencies
pip install -e .
# The evaluation harness is the swebench package
# It will build Docker containers for each task you want to evaluate
Writing a Simple Agent Wrapper
To evaluate your agent on SWE-bench, you need to produce prediction files - JSON files mapping instance IDs to your agent's patch output.
"""
swebench_agent_wrapper.py - Wrap a coding agent for SWE-bench evaluation.
"""
import json
from pathlib import Path
from typing import Optional
import subprocess
import tempfile
import os
def get_git_diff(repo_path: str) -> str:
"""Get the current git diff as a unified patch."""
result = subprocess.run(
["git", "diff", "--no-color"],
cwd=repo_path,
capture_output=True,
text=True,
)
return result.stdout
def run_agent_on_instance(
instance_id: str,
repo_path: str,
issue_text: str,
agent_fn, # callable(repo_path, task) -> None (modifies repo in place)
) -> Optional[str]:
"""
Run an agent on a single SWE-bench instance.
Returns the git diff patch if the agent made changes, else None.
"""
# Reset any existing changes
subprocess.run(["git", "checkout", "."], cwd=repo_path, check=True)
# Run the agent
try:
agent_fn(repo_path=repo_path, task=issue_text)
except Exception as e:
print(f"Agent failed on {instance_id}: {e}")
return None
# Collect the diff
diff = get_git_diff(repo_path)
if not diff.strip():
print(f"Agent made no changes on {instance_id}")
return None
return diff
def evaluate_predictions(
predictions_file: str,
dataset_name: str = "SWE-bench_Verified",
split: str = "test",
) -> dict:
"""
Run the SWE-bench evaluation harness on a predictions file.
predictions_file: JSON file mapping instance_id -> {"model_patch": "..."}
Returns: evaluation results dict
"""
result = subprocess.run(
[
"python", "-m", "swebench.harness.run_evaluation",
"--dataset_name", dataset_name,
"--split", split,
"--predictions_path", predictions_file,
"--max_workers", "4",
"--run_id", "my_agent_eval",
],
capture_output=True,
text=True,
)
print(result.stdout)
if result.returncode != 0:
print("STDERR:", result.stderr)
return {}
def build_predictions_file(
instances: list[dict],
agent_fn,
output_path: str = "predictions.json",
) -> str:
"""
Run an agent on a list of SWE-bench instances and write a predictions file.
instances: list of dicts with keys: instance_id, repo, base_commit, problem_statement
agent_fn: callable(repo_path, task) that modifies the repo in place
"""
predictions = {}
with tempfile.TemporaryDirectory() as tmpdir:
for instance in instances:
instance_id = instance["instance_id"]
repo_url = f"https://github.com/{instance['repo']}.git"
repo_path = os.path.join(tmpdir, instance_id)
print(f"\nProcessing: {instance_id}")
# Clone and checkout the correct base commit
subprocess.run(
["git", "clone", repo_url, repo_path],
check=True,
capture_output=True,
)
subprocess.run(
["git", "checkout", instance["base_commit"]],
cwd=repo_path,
check=True,
capture_output=True,
)
# Run the agent
patch = run_agent_on_instance(
instance_id=instance_id,
repo_path=repo_path,
issue_text=instance["problem_statement"],
agent_fn=agent_fn,
)
predictions[instance_id] = {
"model_patch": patch or "",
"model_name_or_path": "my-coding-agent",
}
# Write predictions
with open(output_path, "w") as f:
json.dump(predictions, f, indent=2)
print(f"\nWrote {len(predictions)} predictions to {output_path}")
return output_path
Building Domain-Specific Evaluation Inspired by SWE-bench
SWE-bench evaluates on public Python repos. Your production use case probably involves private code in different languages. Here is how to build a SWE-bench-inspired evaluation for your own codebase:
"""
custom_eval_harness.py - Build a SWE-bench-style eval for your own codebase.
"""
import json
import subprocess
import tempfile
import os
from pathlib import Path
from dataclasses import dataclass, asdict
from typing import Optional
import datetime
@dataclass
class EvalInstance:
"""A single evaluation task."""
instance_id: str
repo_path: str # local path to repo
base_commit: str # git hash of the pre-fix state
fixed_commit: str # git hash of the fix (for reference)
issue_description: str # what the bug is
test_command: str # e.g., "pytest tests/test_users.py -v"
expected_tests_pass: list[str] # test IDs that must pass
@dataclass
class EvalResult:
"""Result of running an agent on an instance."""
instance_id: str
resolved: bool
patch: str
tests_passed: list[str]
tests_failed: list[str]
agent_steps: int
error: Optional[str]
timestamp: str
def collect_instances_from_git_history(
repo_path: str,
lookback_commits: int = 100,
) -> list[EvalInstance]:
"""
Automatically collect evaluation instances from git history.
Looks for commits that:
- Reference an issue number (#NNN)
- Added or modified test files
"""
# Get recent commits
result = subprocess.run(
["git", "log", f"-{lookback_commits}", "--oneline", "--no-decorate"],
cwd=repo_path,
capture_output=True,
text=True,
)
instances = []
lines = result.stdout.strip().split("\n")
for line in lines:
if not line.strip():
continue
parts = line.split(" ", 1)
if len(parts) < 2:
continue
commit_hash, message = parts
# Look for issue references
if "#" not in message and "fix" not in message.lower() and "bug" not in message.lower():
continue
# Check if this commit modified test files
files_result = subprocess.run(
["git", "diff-tree", "--no-commit-id", "-r", "--name-only", commit_hash],
cwd=repo_path,
capture_output=True,
text=True,
)
changed_files = files_result.stdout.strip().split("\n")
test_files = [f for f in changed_files if "test" in f.lower()]
if not test_files:
continue
# Get the parent commit (pre-fix state)
parent_result = subprocess.run(
["git", "rev-parse", f"{commit_hash}^"],
cwd=repo_path,
capture_output=True,
text=True,
)
base_commit = parent_result.stdout.strip()
# Get commit message body as issue description
body_result = subprocess.run(
["git", "log", "-1", "--format=%B", commit_hash],
cwd=repo_path,
capture_output=True,
text=True,
)
instance = EvalInstance(
instance_id=f"{os.path.basename(repo_path)}-{commit_hash[:8]}",
repo_path=repo_path,
base_commit=base_commit,
fixed_commit=commit_hash,
issue_description=body_result.stdout.strip(),
test_command="pytest",
expected_tests_pass=[], # will be populated by examining the fixed commit
)
instances.append(instance)
return instances
def run_evaluation(
instance: EvalInstance,
agent_fn,
verbose: bool = True,
) -> EvalResult:
"""Run an agent on a single eval instance and return the result."""
timestamp = datetime.datetime.now().isoformat()
with tempfile.TemporaryDirectory() as tmpdir:
# Clone repo and checkout base commit
repo_copy = os.path.join(tmpdir, "repo")
subprocess.run(
["git", "clone", instance.repo_path, repo_copy],
check=True,
capture_output=True,
)
subprocess.run(
["git", "checkout", instance.base_commit],
cwd=repo_copy,
check=True,
capture_output=True,
)
# Run the agent
steps = 0
error = None
try:
steps = agent_fn(
repo_path=repo_copy,
task=instance.issue_description,
)
except Exception as e:
error = str(e)
if verbose:
print(f"Agent error: {e}")
# Get the patch
patch_result = subprocess.run(
["git", "diff", "--no-color"],
cwd=repo_copy,
capture_output=True,
text=True,
)
patch = patch_result.stdout
# Run the test suite
test_result = subprocess.run(
instance.test_command.split(),
cwd=repo_copy,
capture_output=True,
text=True,
timeout=300,
)
# Parse test results
passed, failed = parse_pytest_output(test_result.stdout + test_result.stderr)
resolved = len(failed) == 0 and len(passed) > 0
if verbose:
status = "RESOLVED" if resolved else "UNRESOLVED"
print(f" {instance.instance_id}: {status} ({len(passed)} passed, {len(failed)} failed)")
return EvalResult(
instance_id=instance.instance_id,
resolved=resolved,
patch=patch,
tests_passed=passed,
tests_failed=failed,
agent_steps=steps,
error=error,
timestamp=timestamp,
)
def parse_pytest_output(output: str) -> tuple[list[str], list[str]]:
"""Parse pytest output to extract passed and failed test IDs."""
passed = []
failed = []
for line in output.split("\n"):
line = line.strip()
if " PASSED" in line:
test_id = line.split(" PASSED")[0].strip()
passed.append(test_id)
elif " FAILED" in line:
test_id = line.split(" FAILED")[0].strip()
failed.append(test_id)
elif " ERROR" in line and "::" in line:
test_id = line.split(" ERROR")[0].strip()
failed.append(test_id)
return passed, failed
def run_benchmark(
instances: list[EvalInstance],
agent_fn,
output_file: str = "eval_results.json",
) -> dict:
"""Run a full benchmark evaluation."""
results = []
for i, instance in enumerate(instances):
print(f"\n[{i+1}/{len(instances)}] {instance.instance_id}")
result = run_evaluation(instance, agent_fn)
results.append(asdict(result))
# Compute summary stats
total = len(results)
resolved = sum(1 for r in results if r["resolved"])
resolution_rate = resolved / total if total > 0 else 0
summary = {
"total": total,
"resolved": resolved,
"unresolved": total - resolved,
"resolution_rate": resolution_rate,
"results": results,
}
with open(output_file, "w") as f:
json.dump(summary, f, indent=2)
print(f"\n{'='*50}")
print(f"Benchmark Results:")
print(f" Total: {total}")
print(f" Resolved: {resolved} ({resolution_rate:.1%})")
print(f" Results written to: {output_file}")
return summary
Overfitting Concerns and Benchmark Contamination
The Contamination Problem
Training data contamination is a legitimate concern with SWE-bench. If the training data for a model includes the GitHub PRs that SWE-bench tests are drawn from, the model may have seen the solutions during training.
The SWE-bench Verified team partially addressed this by selecting issues from after certain training data cutoffs. But as models are continuously updated, contamination analysis becomes ongoing work.
Signs of potential contamination:
- Performance on specific repos significantly exceeds the broader benchmark
- The agent reproduces exact solutions that are not inferable from the issue description
- Performance degrades sharply on tasks from repositories that are less represented in training data
Overfitting to the Benchmark
As SWE-bench becomes the standard, there is pressure to optimize specifically for it. This can manifest as:
- System prompt engineering - prompts tuned specifically for the types of issues in SWE-bench repos
- Repository-specific fine-tuning - fine-tuning on Django or scikit-learn issues
- Evaluation harness awareness - agents that detect they are being evaluated and behave differently
This is why MultiSWE-bench (multiple languages, different repos) is valuable - it is harder to overfit to a broader, more diverse benchmark.
Verification of Benchmark Claims
When you read a benchmark claim:
- Which variant? SWE-bench, Lite, or Verified are different benchmarks
- What evaluation conditions? Internet access, time limit, model version
- Human evaluation? Were results spot-checked by humans?
- Independently reproduced? Can another team reproduce the result?
- When? SOTA changes fast - a 6-month-old number may be outdated
:::tip Build your own eval For production systems, SWE-bench is a proxy. Build your own evaluation using real tasks from your actual codebase. Collect 20-50 historical bug fixes, write the evaluation harness, and measure your agent on those. That number will be far more predictive of real-world utility than a public benchmark score. :::
:::danger Don't optimize for SWE-bench alone A 60% SWE-bench score does not mean 60% of your team's tasks will be automated. SWE-bench covers Python, bug fixes, and specific repository types. Your codebase is different. Use SWE-bench as a relative comparison tool, not an absolute predictor. :::
Interview Q&A
Q: What is SWE-bench and why is it considered the gold standard for evaluating coding agents?
A: SWE-bench is a benchmark of 2,294 real GitHub issues from 12 popular Python repositories. Each task provides the pre-fix repository state and an issue description; the agent must modify the code so that the held-out test suite passes. It is considered authoritative because: (1) it uses real software engineering tasks, not artificially constructed puzzles; (2) test-based evaluation is objective and reproducible; (3) it is widely adopted, enabling direct comparison across systems; (4) the tasks are genuinely difficult - requiring code navigation, root cause analysis, and correct edits.
Q: What is the difference between SWE-bench, SWE-bench Lite, and SWE-bench Verified?
A: SWE-bench is the full 2,294-task benchmark. SWE-bench Lite is a 300-task random subset for faster evaluation. SWE-bench Verified is a 500-task human-verified subset where each task was manually reviewed to ensure it is well-specified, solvable from the given information, and fairly testable. Verified is generally preferred for current evaluation because it removes ambiguous or flawed tasks that made the original benchmark noisy.
Q: If Claude achieves 57% on SWE-bench Verified, what does that mean practically?
A: It means Claude autonomously resolves 57% of real GitHub bug reports - finding the relevant code, understanding the bug, making the correct fix, and passing the test suite - with no human assistance. For context: these are real issues that required human software engineers to fix when they were originally encountered. 57% autonomous resolution represents a dramatic productivity multiplier for tasks of this type. The 43% it cannot solve tends to involve complex architectural changes, ambiguous specifications, or domain-specific knowledge not present in the codebase.
Q: How would you build a SWE-bench-style evaluation for a private codebase?
A: The key steps: (1) Collect historical bug fixes from git history - commits that reference issues and modified test files; (2) For each fix, record the pre-fix repository state (base commit) and the issue description; (3) Build a test runner that checks out the base commit, runs the agent, then executes the test suite; (4) Score binary: resolved if all tests pass, unresolved otherwise. The most important aspect is having a reliable, automated test suite. Without tests, you cannot objectively measure whether the fix is correct.
Q: What are the main ways coding agents fail on SWE-bench tasks?
A: The top failure modes are: incorrect diagnosis (~25%) - fixing the symptom rather than the root cause; context exhaustion (~20%) - filling the context window with irrelevant files before understanding the real problem; wrong location (~15%) - correct fix logic but applied to the wrong function or file; test blindness (~10%) - the fix passes the targeted test but breaks unrelated tests; edit execution errors (~8%) - the old_str not matching exactly due to whitespace or encoding differences; and scope creep (~7%) - unnecessary refactoring that introduces regressions.
Q: How should engineers interpret published benchmark numbers when evaluating coding agent products?
A: With significant skepticism and several questions: Which benchmark variant? (Verified vs full vs Lite are not comparable.) What conditions? (Internet access, time limit, model version all matter.) Is it independently verified? (Cognition's Devin claimed 13.86% but independent evaluation found ~4.8% under controlled conditions.) How recent? (SOTA moves fast; six-month-old numbers may be significantly behind.) And most importantly: does the benchmark task distribution match your actual use case? A 57% SWE-bench score on Python bug fixes tells you little about how the agent will perform on your TypeScript monorepo.
