SWE-bench Verified is the gold standard for evaluating coding agents on real GitHub issues. Learn the evaluation methodology, Docker harness, failure mode taxonomy, and how to interpret benchmark scores.

How does SWE-bench Verified work in practice?

SWE-bench Verified covers SWE-bench, SWE-bench Verified, coding agent evaluation from first principles with code examples. Free lesson at https://engineersofai.com/docs/agentic-ai/agent-evaluation/swe-bench-verified

What is the difference between SWE-bench and coding agent evaluation?

See the full breakdown at https://engineersofai.com/docs/agentic-ai/agent-evaluation/swe-bench-verified

SWE-bench Verified

The Benchmark That Doesn't Lie

Fix a real GitHub issue. Make the existing tests pass. No hints. No guided steps. No partial credit.

SWE-bench Verified is the gold standard for coding agent evaluation because it is ruthlessly honest. Either the tests pass or they do not. Either the patch applies cleanly or it does not. There is no "almost correct" and no "it looks right but we cannot verify" - the evaluation is automated, reproducible, and unambiguous.

When an AI lab reports "50% on SWE-bench Verified," they mean: given 500 real GitHub issues from popular Python repositories, their agent submitted code patches that, when applied, made the original test suite pass for 250 of them. That is a number you can trust.

That makes SWE-bench Verified uniquely valuable in a field full of benchmarks that are gamed, contaminated, or measuring something slightly different from what matters.

:::tip 🎮 Interactive Playground Visualize this concept: Try the Agent Evaluation demo on the EngineersOfAI Playground - no code required. :::

Why SWE-bench Was Created

The Problem With Code Generation Benchmarks

Before SWE-bench, coding benchmarks were largely function-completion tasks: "complete this function given a docstring." HumanEval, MBPP, CodeContests. These measured a real capability - local code generation - but missed everything that makes software engineering actually hard: understanding a codebase, reading bug reports, navigating multiple files, reasoning about state across a system.

SWE-bench (Jimenez et al., Princeton, 2023) was designed to fill this gap. The key insight: real software engineering involves:

Understanding a bug report or feature request in natural language
Navigating a large existing codebase
Identifying the relevant files and functions
Making targeted changes that fix the issue
Not breaking anything else

This is exactly what professional software engineers do. It is also exactly what capable coding agents need to do. A benchmark built from real GitHub issues tests real engineering competence.

SWE-bench Original vs SWE-bench Verified

The Problem With the Original

SWE-bench (original, 2023) had 2,294 issues from 12 GitHub repositories. It was groundbreaking, but it had a critical flaw: approximately 25% of the "issues" were low-quality evaluation targets. Some issues had:

Tests that could be passed trivially by adding an empty pass statement
Test suites that were already broken before the issue was filed
Issues that were resolved in a way that contradicted the actual intent
Ambiguous problem statements that made multiple contradictory solutions "correct"

This meant models could score higher than their genuine capability by exploiting these weak examples.

The Verified Solution

SWE-bench Verified (Chowdhury et al., OpenAI, 2024) took the original dataset and hired professional software engineers to manually review every issue. The process:

Review the GitHub issue and understand the actual intent
Review the gold patch solution that resolves it
Review the test cases that validate the fix
Confirm that: the problem is well-defined, the gold patch correctly resolves it, the tests actually verify what they claim to verify

After verification: 500 of the 2,294 original issues were confirmed as high-quality evaluation targets. The other 1,794 were removed - not necessarily broken, but not sufficiently verified.

The result: SWE-bench Verified is harder to game and more representative of genuine coding agent capability. Scores on Verified are lower than on the original for the same system - but more honest.

Evaluation Methodology

The Docker Harness

SWE-bench Verified uses a fully isolated Docker-based evaluation harness. This is essential for:

Reproducibility: Every evaluation runs in the same environment
Safety: Agent-generated code runs in isolation, not on the host system
Fairness: All agents face identical execution environments
Correctness: Tests run against the exact repository state described in the issue

The evaluation flow for each issue:

1. Pull Docker image for the specific repo and issue
   └── Image contains: repo at the commit before the fix, all dependencies
2. Apply the agent's submitted patch
   └── Patch format: unified diff, applied with `git apply`
3. Run the test suite
   └── `pytest` or repo-specific test runner
4. Check test results
   └── Task-relevant tests pass? → SUCCESS
   └── Any task-relevant tests fail? → FAILURE
   └── Patch didn't apply? → FAILURE (unresolved)
5. Report result

The test suite check is specifically targeted: the evaluation does not require ALL tests to pass, only the tests that are relevant to the issue. This is important because issues sometimes require changes that affect other tests - but those other tests would have changed too in a real fix.

Task Instance Format

Each SWE-bench task instance has this structure:

{
  "instance_id": "django__django-11099",
  "repo": "django/django",
  "base_commit": "a3e2a584b7...",
  "problem_statement": "HttpResponse doesn't handle memoryview objects\n\nWhen I...",
  "hints_text": "",
  "created_at": "2019-04-24T19:05:15Z",
  "patch": "diff --git a/django/http/response.py ...",
  "test_patch": "diff --git a/tests/httpwrappers/tests.py ...",
  "version": "2.2",
  "PASS_TO_PASS": ["tests.httpwrappers.tests.QueryDictTests.test_..."],
  "FAIL_TO_PASS": ["tests.httpwrappers.tests.HttpResponseTests.test_memoryview"]
}

The critical fields:

problem_statement: what the agent sees - the GitHub issue text
patch: the gold patch solution (NOT shown to the agent)
test_patch: tests added to verify the fix (NOT shown to the agent)
FAIL_TO_PASS: tests that must change from failing to passing for success
PASS_TO_PASS: existing tests that must continue to pass (regression check)

Task Difficulty Distribution

Not all SWE-bench Verified tasks are equally hard. Understanding the distribution helps interpret scores:

Difficulty Factor	Easier	Harder
Repository familiarity	Flask, Requests (simpler codebase)	Django, Sympy (complex, massive)
Issue clarity	Clear bug with reproduction steps	Vague feature request
Files to modify	1-2 files	5+ files across subsystems
Lines of code changed	1-20 lines	50+ lines
Test suite complexity	Simple unit tests	Complex integration tests
Domain knowledge required	General Python	Domain-specific (math, ORM, HTTP)

Repository Hardness Rankings (approximate)

Based on observed agent success rates:

Repository	Relative Difficulty	Reason
Requests	Easier	Small codebase, clear architecture
Flask	Easier	Well-documented, limited scope
Pylint	Medium	Large but modular
Scikit-learn	Medium	Complex but well-tested
Matplotlib	Hard	Large, complex rendering pipeline
Sympy	Hard	Deep math domain knowledge required
Django	Hardest	Massive, complex ORM/HTTP/admin system
Astropy	Hardest	Astronomy domain knowledge + complex code

SOTA Timeline: 2023–2025

The progress on SWE-bench has been remarkable and instructive:

Date	System	Score	Key Innovation
Nov 2023	SWE-agent (baseline)	1.96%	First end-to-end agent
Mar 2024	Devin (Cognition)	13.86%	Long-horizon planning
Apr 2024	SWE-agent (improved)	18.0%	Better file navigation
Jun 2024	AgentLess	27.3%	Simplified localization
Sep 2024	OpenHands (OpenDevin)	37.4%	Better context management
Nov 2024	Claude agents (Anthropic)	49.0%	Extended thinking
Feb 2025	Best reported systems	54%+	Ensemble + verification
Human (baseline)	-	~86%	Domain expert with time

What drove the gains:

2023-2024: Basic scaffolding - file navigation, issue reading, patch generation. Going from 2% to 15% was mostly about building the basic agentic loop.
2024: Localization improvements - identifying which files to edit became much more precise. Going from 15% to 35%.
2024-2025: Context management and reasoning - handling large codebases, multi-file changes, and complex reasoning chains. Going from 35% to 50%+.
2025+: Verification and ensemble - agents that verify their patches, retry on failure, and use multiple strategies. Pushing past 50%.

Failure Mode Taxonomy

Understanding why agents fail on SWE-bench is as valuable as the score itself. Five primary failure categories:

1. Localization Failure (35% of failures)

The agent cannot identify which files or functions need to change. Symptoms:

Agent modifies the wrong file entirely
Agent makes correct-sounding changes in the wrong location
Agent exhausts context reading irrelevant files

Root cause: poor repository navigation strategy, weak codebase understanding.

2. Scope Misunderstanding (25% of failures)

The agent fixes the immediate symptom but not the underlying cause, or over-engineers beyond what the issue requires. Symptoms:

PASS_TO_PASS tests break (agent changed something it should not have)
Patch applies but tests still fail for the core issue
Agent adds features instead of fixing the bug

Root cause: insufficient issue comprehension, poor judgment about scope.

3. Logical Errors (20% of failures)

The agent correctly identifies what needs to change but implements the change incorrectly. Symptoms:

Patch applies, tests run, but test assertions fail
Edge cases handled incorrectly
Off-by-one errors, incorrect type handling

Root cause: reasoning errors in the implementation step.

4. Context Window Issues (12% of failures)

The agent runs out of effective context for large repositories. Symptoms:

Agent "forgets" earlier findings and repeats exploration
Agent makes changes inconsistent with earlier analysis
Agent truncates important context

Root cause: long trajectories, large file contents exceeding effective context.

5. Syntax/Application Errors (8% of failures)

The patch cannot be applied cleanly. Symptoms:

git apply fails with "patch does not apply"
Indentation errors, merge conflicts with base
Invalid Python syntax

Root cause: poor patch generation, not tracking exact line numbers and indentation.

Contamination Concerns

SWE-bench contamination is a legitimate concern: many of these GitHub issues and their solutions are public. They could appear in model training data.

How Contamination Is Detected

Date filtering: All SWE-bench issues have dates. Models trained before those dates cannot have seen the solutions.
Memorization probes: Ask the model to reproduce the gold patch without being given the repository. If it can, it may have memorized it.
Distribution analysis: Compare performance on issues created before and after the model's training cutoff. Contaminated models show higher scores on pre-cutoff issues.
Alternative benchmarks: SWE-bench Lite (300 issues) and MultiSWE-bench (multilingual) provide cross-validation.

Mitigation

The SWE-bench team has released tools for contamination detection. When reporting results, labs should specify: model training cutoff, whether the validation split or test split was used, and any contamination analysis performed.

SWE-bench Variants

SWE-bench Lite

300 tasks, selected from SWE-bench to be representative but faster to evaluate. Use SWE-bench Lite for:

Rapid experimentation (full Verified takes 8+ hours to evaluate, Lite takes 3)
Compute-constrained settings
Initial iteration before committing to full Verified evaluation

Scores on Lite are not directly comparable to Verified - different task selection.

MultiSWE-bench (2025)

A multilingual extension covering Java, TypeScript, JavaScript, Go, Rust, and C++ in addition to Python. Harder for two reasons:

Less training data for non-Python languages in most models
Tool ecosystems (compilers, package managers, test runners) are more diverse

MultiSWE-bench scores are significantly lower than Python-only Verified scores for all current systems.

Running SWE-bench Locally

Setup

# Install the evaluation harness
pip install swebench

# Pull the Docker images (large - several GB)
# Images are pre-built for each repo/version combination
docker pull swebench/harness:latest

Running Evaluation

"""
SWE-bench evaluation runner - integrates your agent with the evaluation harness.
"""

import json
import subprocess
import tempfile
import os
from dataclasses import dataclass
from typing import Optional

# In production: pip install swebench
# from swebench.harness.run_evaluation import main as run_evaluation


@dataclass
class PatchSubmission:
    """A patch submission for a SWE-bench task."""
    instance_id: str
    model_patch: str          # The unified diff patch your agent generated
    model_name_or_path: str   # Name of your system for leaderboard


@dataclass
class EvaluationResult:
    instance_id: str
    resolved: bool            # FAIL_TO_PASS tests now pass
    pass_to_pass: float       # Fraction of PASS_TO_PASS tests still passing
    patch_applied: bool       # Whether the patch could be applied at all


class SWEBenchRunner:
    """
    Wrapper for running SWE-bench evaluation.
    Calls the official harness for correct, reproducible evaluation.
    """

    def __init__(self, predictions_path: str, model_name: str):
        self.predictions_path = predictions_path
        self.model_name = model_name

    def save_predictions(self, submissions: list[PatchSubmission]) -> str:
        """Save predictions in the format expected by the SWE-bench harness."""
        predictions = {}
        for sub in submissions:
            predictions[sub.instance_id] = {
                "model_patch": sub.model_patch,
                "model_name_or_path": sub.model_name_or_path,
                "instance_id": sub.instance_id,
            }

        os.makedirs(os.path.dirname(self.predictions_path), exist_ok=True)
        with open(self.predictions_path, "w") as f:
            json.dump(predictions, f, indent=2)

        print(f"Saved {len(submissions)} predictions to {self.predictions_path}")
        return self.predictions_path

    def run_evaluation(
        self,
        dataset_name: str = "princeton-nlp/SWE-bench_Verified",
        split: str = "test",
        max_workers: int = 4,
    ) -> dict:
        """
        Run the official SWE-bench evaluation harness.
        Returns a dict mapping instance_id -> resolved (bool).
        """
        # The actual harness command:
        cmd = [
            "python", "-m", "swebench.harness.run_evaluation",
            "--dataset_name", dataset_name,
            "--split", split,
            "--predictions_path", self.predictions_path,
            "--max_workers", str(max_workers),
            "--run_id", self.model_name,
        ]

        print(f"Running: {' '.join(cmd)}")
        result = subprocess.run(cmd, capture_output=True, text=True)

        if result.returncode != 0:
            print(f"Evaluation failed:\n{result.stderr}")
            return {}

        # Parse results from the output
        results_path = f"logs/run_evaluation/{self.model_name}/results.json"
        try:
            with open(results_path) as f:
                return json.load(f)
        except FileNotFoundError:
            print("Results file not found - check logs for errors")
            return {}

    def compute_score(self, results: dict) -> float:
        """Compute the overall SWE-bench score from results dict."""
        if not results:
            return 0.0
        resolved = sum(1 for r in results.values() if r.get("resolved", False))
        return resolved / len(results)


# ── Agent patch generator ──────────────────────────────────────────────────────

import anthropic
client = anthropic.Anthropic()


class CodingAgent:
    """
    A coding agent that generates patches for SWE-bench issues.
    This implements the core logic; the actual agent would have
    more sophisticated file navigation and editing.
    """

    def __init__(self, max_steps: int = 40):
        self.max_steps = max_steps
        self.tools = self._build_tools()

    def _build_tools(self) -> list[dict]:
        return [
            {
                "name": "read_file",
                "description": "Read a file from the repository.",
                "input_schema": {
                    "type": "object",
                    "properties": {
                        "path": {"type": "string", "description": "Relative file path"},
                    },
                    "required": ["path"],
                },
            },
            {
                "name": "list_directory",
                "description": "List files in a directory.",
                "input_schema": {
                    "type": "object",
                    "properties": {
                        "path": {"type": "string", "default": "."},
                    },
                    "required": [],
                },
            },
            {
                "name": "search_code",
                "description": "Search for a pattern in the repository code.",
                "input_schema": {
                    "type": "object",
                    "properties": {
                        "pattern": {"type": "string"},
                        "file_pattern": {"type": "string", "description": "e.g., '*.py'"},
                    },
                    "required": ["pattern"],
                },
            },
            {
                "name": "edit_file",
                "description": "Make a targeted edit to a file. Specify old content and new content.",
                "input_schema": {
                    "type": "object",
                    "properties": {
                        "path": {"type": "string"},
                        "old_content": {"type": "string", "description": "Exact content to replace"},
                        "new_content": {"type": "string", "description": "Replacement content"},
                    },
                    "required": ["path", "old_content", "new_content"],
                },
            },
            {
                "name": "run_tests",
                "description": "Run specific tests to check if fixes are working.",
                "input_schema": {
                    "type": "object",
                    "properties": {
                        "test_paths": {
                            "type": "array",
                            "items": {"type": "string"},
                        },
                    },
                    "required": ["test_paths"],
                },
            },
            {
                "name": "generate_patch",
                "description": "Generate a unified diff patch from the current changes.",
                "input_schema": {
                    "type": "object",
                    "properties": {
                        "files_changed": {
                            "type": "array",
                            "items": {"type": "string"},
                            "description": "List of files that were modified",
                        },
                    },
                    "required": ["files_changed"],
                },
            },
        ]

    def solve_issue(
        self,
        instance_id: str,
        problem_statement: str,
        repo_path: str,
    ) -> Optional[str]:
        """
        Solve a SWE-bench issue.
        Returns a unified diff patch string, or None if failed.
        """
        system = """You are an expert software engineer solving GitHub issues.

Your workflow:
1. Read and understand the problem statement carefully
2. Navigate the codebase to find the relevant code
3. Make targeted, minimal changes to fix the issue
4. Run the tests to verify your fix
5. Generate a clean patch

Key principles:
- Make the minimal change that fixes the issue
- Do not change tests unless the issue explicitly asks you to add/modify tests
- Do not add unnecessary features or refactoring
- Ensure your changes don't break existing functionality
- When ready, call generate_patch with the list of modified files
"""

        messages = [{
            "role": "user",
            "content": f"Fix this GitHub issue in the repository at {repo_path}:\n\n{problem_statement}"
        }]

        for step in range(self.max_steps):
            response = client.messages.create(
                model="claude-opus-4-6",
                max_tokens=8096,
                system=system,
                tools=self.tools,
                messages=messages,
            )

            text = next((b.text for b in response.content if hasattr(b, "text")), "")

            if response.stop_reason == "end_turn":
                # Agent gave up or produced a non-tool response
                return None

            if response.stop_reason == "tool_use":
                messages.append({"role": "assistant", "content": response.content})
                tool_results = []

                for block in response.content:
                    if block.type == "tool_use":
                        result = self._execute_tool(
                            block.name, block.input, repo_path
                        )

                        if block.name == "generate_patch" and result.startswith("diff"):
                            # Agent is done - return the patch
                            return result

                        tool_results.append({
                            "type": "tool_result",
                            "tool_use_id": block.id,
                            "content": result,
                        })

                messages.append({"role": "user", "content": tool_results})

        return None

    def _execute_tool(
        self, name: str, tool_input: dict, repo_path: str
    ) -> str:
        """Execute a tool against the actual repository."""
        import subprocess

        if name == "read_file":
            path = os.path.join(repo_path, tool_input["path"])
            try:
                with open(path, "r", encoding="utf-8", errors="replace") as f:
                    content = f.read()
                # Truncate very large files
                if len(content) > 50_000:
                    content = content[:50_000] + "\n... (truncated)"
                return content
            except FileNotFoundError:
                return f"File not found: {tool_input['path']}"

        elif name == "list_directory":
            path = os.path.join(repo_path, tool_input.get("path", "."))
            try:
                result = subprocess.run(
                    ["find", path, "-maxdepth", "2", "-type", "f", "-name", "*.py"],
                    capture_output=True, text=True, cwd=repo_path
                )
                return result.stdout[:5000]
            except Exception as e:
                return f"Error: {e}"

        elif name == "search_code":
            pattern = tool_input["pattern"]
            file_pattern = tool_input.get("file_pattern", "*.py")
            try:
                result = subprocess.run(
                    ["grep", "-r", "--include", file_pattern, "-n", pattern, "."],
                    capture_output=True, text=True, cwd=repo_path
                )
                return result.stdout[:5000] or "No matches found"
            except Exception as e:
                return f"Error: {e}"

        elif name == "edit_file":
            path = os.path.join(repo_path, tool_input["path"])
            old_content = tool_input["old_content"]
            new_content = tool_input["new_content"]
            try:
                with open(path, "r", encoding="utf-8") as f:
                    file_content = f.read()
                if old_content not in file_content:
                    return f"Error: exact content not found in {tool_input['path']}"
                file_content = file_content.replace(old_content, new_content, 1)
                with open(path, "w", encoding="utf-8") as f:
                    f.write(file_content)
                return f"Successfully edited {tool_input['path']}"
            except Exception as e:
                return f"Error: {e}"

        elif name == "run_tests":
            test_paths = tool_input.get("test_paths", [])
            try:
                result = subprocess.run(
                    ["python", "-m", "pytest", "--tb=short", "-x"] + test_paths,
                    capture_output=True, text=True, cwd=repo_path, timeout=120
                )
                output = result.stdout + result.stderr
                return output[-3000:] if len(output) > 3000 else output
            except subprocess.TimeoutExpired:
                return "Tests timed out after 120 seconds"
            except Exception as e:
                return f"Error running tests: {e}"

        elif name == "generate_patch":
            files_changed = tool_input.get("files_changed", [])
            try:
                result = subprocess.run(
                    ["git", "diff"] + files_changed,
                    capture_output=True, text=True, cwd=repo_path
                )
                return result.stdout or "No changes detected"
            except Exception as e:
                return f"Error generating patch: {e}"

        return f"Unknown tool: {name}"


# ── Interpreting SWE-bench Scores ──────────────────────────────────────────────

def interpret_swebench_score(score: float) -> str:
    """
    Interpretation guide for SWE-bench Verified scores.
    These are approximate thresholds as of 2025.
    """
    if score < 0.05:
        return ("Score < 5%: Minimal capability. Agent can occasionally fix "
                "trivial issues but lacks reliable codebase navigation.")
    elif score < 0.20:
        return ("Score 5-20%: Emerging capability. Can fix simple, well-localized "
                "issues but struggles with multi-file changes.")
    elif score < 0.35:
        return ("Score 20-35%: Moderate capability. Competitive with early "
                "production coding agents. Useful for simple bug fixes.")
    elif score < 0.50:
        return ("Score 35-50%: Strong capability. Competitive with state-of-the-art "
                "as of 2024. Useful for real-world coding tasks with oversight.")
    elif score < 0.65:
        return ("Score 50-65%: Leading capability as of 2025. Handles complex "
                "multi-file changes, good error recovery.")
    else:
        return ("Score 65%+: Human-competitive on many issue types. "
                "Approaches professional software engineer performance.")


def benchmark_to_production_gap(score: float) -> dict:
    """
    SWE-bench to production performance conversion.
    Benchmark performance does not directly translate to production.
    """
    return {
        "benchmark_score": score,
        "production_estimate": score * 0.6,  # Production is typically ~40% harder
        "reasons_for_gap": [
            "Production issues often have ambiguous requirements",
            "Production codebases are larger and less clean",
            "Production issues may require understanding business context",
            "No gold test suite exists - agent must identify what to test",
            "Multi-language codebases add complexity",
            "Real issues often require discussion with stakeholders",
        ],
        "interpretation": (
            f"A {score:.0%} SWE-bench score suggests roughly "
            f"{score * 0.6:.0%} real-world issue resolution capability "
            f"without human oversight."
        ),
    }


def demo():
    score = 0.50
    print(interpret_swebench_score(score))
    print()
    gap = benchmark_to_production_gap(score)
    print(gap["interpretation"])
    print("\nReasons for benchmark-to-production gap:")
    for reason in gap["reasons_for_gap"]:
        print(f"  - {reason}")


if __name__ == "__main__":
    demo()

The Benchmark-to-Production Gap

A 50% SWE-bench score does not mean your agent can fix 50% of real-world issues autonomously. The gap is real and significant:

In production, there are no tests to run. SWE-bench provides a test suite. In production, the agent must decide what to test. This is often harder than the fix itself.

Production issues are ambiguous. SWE-bench issue reports are clear enough to be verified. Real user-filed issues are often vague, poorly reproduced, or request unclear changes.

Production codebases are larger. SWE-bench repositories are large but bounded. Production codebases can be millions of lines across dozens of services.

Production issues require business context. "Fix this behavior" sometimes means "fix it in a way that doesn't break our API contract with third-party integrators, and doesn't change the UI behavior users depend on." That context does not appear in the code.

A rough rule of thumb: SWE-bench score × 0.6 approximates real-world unassisted issue resolution rate. A 50% SWE-bench agent resolves roughly 30% of production issues autonomously, with the rest requiring human guidance, review, or escalation.

:::danger Do Not Conflate Benchmark Score With Production Quality A 50% SWE-bench Verified score is impressive and represents real capability. It does not mean you can deploy this agent autonomously to fix production issues without human review. Plan for: human review of all generated patches before merging, a testing pipeline that runs the agent's patches through your full CI suite, and escalation paths for issues the agent cannot solve. SWE-bench measures capability; production use requires operational discipline on top. :::

:::warning Contamination Is a Real Concern for Pre-2024 Issues SWE-bench issues created before 2023 may appear in the training data of many models. When a model reports a very high score on early SWE-bench issues, verify whether their training cutoff predates those issues. Genuinely impressive scores are on issues created after the model's training cutoff. :::

Interview Q&A

Q: What is the difference between SWE-bench and SWE-bench Verified?

A: SWE-bench (2023) contained 2,294 issues from GitHub, but roughly 25% had quality problems - trivially passable tests, ambiguous problem statements, or incorrect gold patches. SWE-bench Verified (2024) took the original dataset and had professional software engineers manually review every issue, confirming that the problem statement is clear, the gold patch correctly resolves it, and the tests actually validate the fix. Only 500 of the original 2,294 passed verification. The result is a smaller but far more reliable benchmark where scores genuinely reflect coding agent capability. Scores on Verified are lower for the same system but more honest.

Q: Walk me through exactly how the SWE-bench evaluation harness works.

A: For each task instance, the harness spins up a Docker container with the repository at the pre-fix commit and all dependencies installed. The agent's submitted patch is applied using git apply. If the patch fails to apply, the task is marked as unresolved. If it applies successfully, the repository's test suite is run using pytest. Two sets of tests are checked: FAIL_TO_PASS (tests that were failing before the fix - must now pass), and PASS_TO_PASS (tests that were passing before - must still pass). If all FAIL_TO_PASS tests now pass AND all PASS_TO_PASS tests still pass, the task is resolved. This is done in complete isolation for reproducibility and safety.

Q: What is the primary failure mode for coding agents on SWE-bench?

A: The most common failure mode (approximately 35% of failures) is localization failure - the agent cannot correctly identify which files and functions need to change. The agent reads broadly but fails to pinpoint the relevant code. This is followed by scope misunderstanding (25%) where the agent understands the symptom but not the root cause, making changes that address the surface issue but break PASS_TO_PASS tests or miss the deeper problem. Interestingly, pure implementation error (the agent finds the right code but implements the fix wrong) accounts for only 20% of failures - suggesting that navigation and comprehension are the harder challenges for current systems.

Q: If a system scores 50% on SWE-bench Verified, what should you expect in production?

A: Not 50% autonomous issue resolution. The benchmark-to-production gap is significant. SWE-bench provides a clear test suite to validate fixes - in production, the agent must identify what to test. SWE-bench issues are well-written and reproducible - real user issues are often ambiguous. SWE-bench repositories are bounded and known - production codebases are larger and more complex. A rough estimate: multiply the benchmark score by 0.6 to get the approximate autonomous resolution rate in production. A 50% SWE-bench agent resolves roughly 30% of real issues without human guidance. The agent remains very useful - as a first-pass, with human review of generated patches, it dramatically accelerates engineering velocity.

Q: How would you use SWE-bench in an iterative development workflow for a coding agent?

A: I would use a three-tier approach. For daily iteration (running on every significant prompt or system change), I would use SWE-bench Lite - 300 tasks, evaluates in 3 hours, catches obvious regressions. For weekly evaluation (before feature merges), I would use the full SWE-bench Verified - 500 tasks, evaluates in 8 hours, more reliable signal. For pre-release validation, I would additionally run an internal benchmark of production issues we have encountered, which tests real-world distribution. I would track all three scores with baselines, alert on any 5%+ drop, and use the failure analysis (localization failures vs. scope errors vs. implementation errors) to guide where to invest improvement effort.

What SWE-bench Cannot Measure

SWE-bench Verified is the best coding agent benchmark available, but understanding its limits is as important as understanding what it measures.

What Is Not Tested

Code review quality. SWE-bench tests whether an agent can produce a fix. It does not test whether an agent can review someone else's code, identify subtle logic errors, or provide constructive feedback on a pull request.

Greenfield development. All SWE-bench tasks involve modifying existing code to fix a bug. Building new features from scratch, designing APIs, or architecting new systems is not represented.

Multi-file refactoring. The typical SWE-bench patch touches 1-3 files and makes surgical changes. Large-scale refactoring across many files - a common real engineering task - is not evaluated.

Documentation and communication. Writing clear commit messages, updating documentation, explaining changes in plain language - all important software engineering skills that SWE-bench ignores.

Performance optimization. Making code faster, reducing memory usage, improving algorithmic complexity - none of this is in SWE-bench.

Security vulnerability identification. Finding and fixing security bugs requires different reasoning than fixing functional bugs.

The Implications

A coding agent that scores 50% on SWE-bench Verified has demonstrated genuine capability at one important thing: fixing bugs in Python codebases when given a well-described issue. For teams evaluating coding agents, this should be supplemented with domain-specific evaluation targeting the actual tasks the agent will perform in your engineering workflow.

Building a SWE-bench-Style Internal Benchmark

The most valuable long-term investment for teams building coding agents: construct an internal benchmark from your own production issues.

Collection Protocol

from dataclasses import dataclass
from typing import Optional
import json


@dataclass
class InternalBenchmarkIssue:
    """
    Schema for an internal SWE-bench-style issue.
    Collected from your team's actual bug fixes.
    """
    issue_id: str
    repo_name: str
    repo_commit_before_fix: str  # git commit hash before the fix
    repo_commit_after_fix: str   # git commit hash with the fix applied
    issue_title: str
    issue_description: str       # As written by the original reporter
    gold_patch: str              # git diff of the fix
    test_files_modified: list[str]  # Tests that verify the fix
    difficulty: str              # "easy" | "medium" | "hard"
    fix_category: str            # "bug", "edge_case", "performance", "security"
    lines_changed: int
    files_changed: int
    date_added: str
    added_reason: str            # Why this issue was added to the benchmark


def collect_internal_issues(
    repo_path: str,
    start_date: str,
    end_date: str,
    min_test_coverage: bool = True,
) -> list[InternalBenchmarkIssue]:
    """
    Collect benchmark candidates from your repository's commit history.

    Criteria for inclusion:
    1. PR has associated issue description
    2. PR includes at least one new test (min_test_coverage)
    3. New tests fail before fix and pass after
    4. Fix is in a defined date range
    """
    # Implementation: parse git log for PRs in date range,
    # filter for those with issue descriptions and new tests,
    # verify FAIL_TO_PASS property of the new tests.
    #
    # In practice, use your GitHub/GitLab API to pull PRs,
    # then git show to get diffs and test changes.
    pass


def verify_issue_quality(
    issue: InternalBenchmarkIssue,
    repo_path: str,
) -> dict:
    """
    Verify that an issue is suitable for the benchmark.
    Returns quality assessment with pass/fail and reasons.
    """
    checks = {
        "has_gold_patch": bool(issue.gold_patch),
        "has_test_files": len(issue.test_files_modified) > 0,
        "description_length_ok": len(issue.issue_description) > 50,
        "patch_not_trivial": issue.lines_changed > 2,
    }

    # In production: also verify FAIL_TO_PASS by running tests at both commits
    all_pass = all(checks.values())

    return {
        "approved": all_pass,
        "checks": checks,
        "reason": "All checks passed" if all_pass else (
            f"Failed checks: {[k for k, v in checks.items() if not v]}"
        ),
    }

Versioning and Maintenance

class InternalBenchmarkRegistry:
    """
    Registry for internal benchmark issues.
    Supports versioning so score comparisons are always against
    the same issue set.
    """

    def __init__(self, registry_path: str):
        self.registry_path = registry_path
        self._load()

    def add_issue(self, issue: InternalBenchmarkIssue):
        """Add a new issue to the benchmark."""
        self._issues[issue.issue_id] = issue
        self._save()

    def get_version_snapshot(self, version: str) -> list[InternalBenchmarkIssue]:
        """
        Get the set of issues that were in the benchmark at a given version.
        Enables apples-to-apples score comparison across time.
        """
        return [
            issue for issue in self._issues.values()
            if issue.date_added <= version
        ]

    def stratified_sample(
        self,
        n: int = 50,
        by: str = "difficulty",
    ) -> list[InternalBenchmarkIssue]:
        """
        Sample n issues with balanced representation by difficulty.
        Use this for fast CI evaluation.
        """
        import random
        from collections import defaultdict

        groups = defaultdict(list)
        for issue in self._issues.values():
            key = getattr(issue, by, "unknown")
            groups[key].append(issue)

        n_per_group = max(1, n // len(groups))
        sample = []
        for group_issues in groups.values():
            sample.extend(random.sample(group_issues, min(n_per_group, len(group_issues))))

        random.shuffle(sample)
        return sample[:n]

    def _load(self):
        try:
            with open(self.registry_path) as f:
                data = json.load(f)
            self._issues = {k: InternalBenchmarkIssue(**v) for k, v in data.items()}
        except FileNotFoundError:
            self._issues = {}

    def _save(self):
        with open(self.registry_path, "w") as f:
            json.dump(
                {k: v.__dict__ for k, v in self._issues.items()},
                f, indent=2
            )

The SWE-bench Evaluation Stack in Practice

The three-tier stack provides complementary coverage: SWE-bench Lite gives fast signal for rapid iteration, SWE-bench Verified provides the reliable cross-system comparable score, and the internal benchmark ensures your specific domain is covered. Together, they close most of the gap between benchmark performance and production quality.

The Benchmark That Doesn't Lie​

Why SWE-bench Was Created​

The Problem With Code Generation Benchmarks​

SWE-bench Original vs SWE-bench Verified​

The Problem With the Original​

The Verified Solution​

Evaluation Methodology​

The Docker Harness​

Task Instance Format​

Task Difficulty Distribution​

Repository Hardness Rankings (approximate)​

SOTA Timeline: 2023–2025​

Failure Mode Taxonomy​

1. Localization Failure (35% of failures)​

2. Scope Misunderstanding (25% of failures)​

3. Logical Errors (20% of failures)​

4. Context Window Issues (12% of failures)​

5. Syntax/Application Errors (8% of failures)​

Contamination Concerns​

How Contamination Is Detected​

Mitigation​

SWE-bench Variants​

SWE-bench Lite​

MultiSWE-bench (2025)​

Running SWE-bench Locally​

Setup​

Running Evaluation​

The Benchmark-to-Production Gap​

Interview Q&A​

What SWE-bench Cannot Measure​

What Is Not Tested​

The Implications​

Building a SWE-bench-Style Internal Benchmark​

Collection Protocol​

Versioning and Maintenance​

The SWE-bench Evaluation Stack in Practice​