SWE-bench and Evaluation

Imagine you have built a coding agent. It seems to work well on the examples you tried. Your colleague's agent also seems to work well on different examples. How do you know which one is better? How do you know if either one is actually good?

This is not a rhetorical question. Until 2023, there was no rigorous answer. Coding agents were evaluated informally - "here's a demo where it fixed this bug" - and benchmark claims were almost impossible to compare across systems.

SWE-bench changed that. Published in October 2023 by researchers at Princeton University and the University of Chicago, SWE-bench created a standardized, reproducible evaluation framework for software engineering agents. For the first time, you could compare Devin to Claude to GPT-4 on the same 2,294 real-world tasks with a single number.

Understanding SWE-bench is not just academic. It tells you what current agents can and cannot do, where they fail, and how to think about building better evaluation for your own use case.

Why Evaluation Matters More Than You Think

When Cognition AI announced Devin in March 2024, the headline was "first AI software engineer." The number was 13.86% on SWE-bench.

That number provoked an immediate response from the research community. Researchers at Princeton independently evaluated Devin and found the actual number was closer to 4.8% under controlled conditions. The difference came from evaluation methodology: which instances were selected, whether the agent had internet access during evaluation, how Docker environments were configured.

This is not a story about one company exaggerating results. It is a story about evaluation methodology being as important as the agent itself. When evaluation is ambiguous, everyone measures something different and the numbers are not comparable.

SWE-bench's value is that it pins down every variable: the exact instances, the exact test suites, the exact evaluation conditions. When Claude scores 57% and another system scores 40%, you know those numbers are measuring the same thing.

:::tip 🎮 Interactive Playground Visualize this concept: Try the Coding Agent Loop demo on the EngineersOfAI Playground - no code required. :::

The SWE-bench Paper: Origins and Design

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan published SWE-bench in October 2023.

Their core insight was elegant: GitHub pull requests are already labeled with the problem they solve. The PR description says "this fixes the bug where X happens when Y." The PR adds or modifies test cases that verify the fix. And the original failing tests define what "fixed" means.

This means you can automatically collect thousands of verified software engineering tasks from public repositories, without having to write any evaluation criteria manually.

Dataset Construction

The team scraped GitHub for merged pull requests from popular Python repositories that:

Added or modified test files
Had a clear issue reference or bug description
Could be replicated in a Docker container

From this process, they identified 12 repositories and 2,294 tasks:

Repository	Domain	Tasks
django/django	Web framework	850
scikit-learn/scikit-learn	Machine learning	224
matplotlib/matplotlib	Visualization	184
sympy/sympy	Symbolic math	386
astropy/astropy	Astronomy	101
pytest-dev/pytest	Testing framework	83
pylint-dev/pylint	Code analysis	64
pydicom/pydicom	Medical imaging	57
sphinx-doc/sphinx	Documentation	185
psf/requests	HTTP library	60
pallets/flask	Web framework	31
sqlfluff/sqlfluff	SQL linting	69

The Evaluation Protocol

For each task, the agent receives:

The repository state before the fix
The issue description (what the bug is)
(In some variants) The failing test

The agent must modify the repository such that the held-out test suite passes. The agent does not see the tests - it must figure out what to change based on the issue description alone.

This is exactly what a human developer does: read a bug report, find the relevant code, fix it, and verify it works.

SWE-bench Architecture

Binary Scoring

SWE-bench uses binary scoring: each task is either resolved (all tests pass) or unresolved (any test fails). There is no partial credit.

This is intentional. Real software engineering also has binary outcomes: the code works or it doesn't. A fix that almost passes tests is not a fix.

SWE-bench Variants

SWE-bench Lite

300 randomly sampled tasks from the full 2,294. Used for faster evaluation during development - running the full benchmark requires building 2,294 Docker containers and can take hours even on fast hardware.

SWE-bench Verified

The most important variant for current evaluation. Released in 2024 by Anthropic in collaboration with OpenAI.

The problem with the original SWE-bench: many of the 2,294 tasks have issues that make them unsuitable for evaluation. Some tasks are underspecified. Some require information not provided in the issue. Some have environment-specific failures that have nothing to do with the agent's code changes.

SWE-bench Verified addresses this by having human software engineers manually review and verify 500 tasks. Each task was labeled:

Is the issue description complete and unambiguous?
Can this be solved from the information given?
Are the tests actually testing what the issue describes?

Only tasks that passed human review are included. The result is a cleaner, fairer subset that better reflects real-world software engineering difficulty.

When you see benchmark claims, always check: is this SWE-bench, SWE-bench Lite, or SWE-bench Verified? The numbers are not comparable across variants.

SWE-bench Multilingual (MultiSWE-bench)

Published in 2025 to address the Python-only limitation. Extends the benchmark to:

Java (from popular Spring and Apache projects)
TypeScript (from major Node.js libraries)
Go (from Go standard library tooling)
Rust (from core Rust crates)
C++ (from LLVM and other projects)

This matters because Python-specialized agents perform differently on statically-typed languages. MultiSWE-bench reveals whether agents have learned general software engineering skills or just Python-specific patterns.

Current SOTA: What the Numbers Mean

System	Benchmark	Score	Date
Claude Sonnet (Anthropic)	SWE-bench Verified	~57%	2025
Claude Opus (Anthropic)	SWE-bench Verified	~49%	2024
GPT-4o (OpenAI)	SWE-bench Verified	~38%	2024
OpenHands (Community)	SWE-bench Verified	~35%	2025
SWE-agent (Princeton)	SWE-bench (full)	~12%	2024
Devin (Cognition)	SWE-bench (full)	~13.86%	2024

What does 57% actually mean?

It means that given a real GitHub bug report, with no human assistance, Claude autonomously fixed the bug such that the original test suite passes - more than half the time.

To put this in human terms: imagine hiring a junior developer and giving them 100 bug reports. If they resolved 57 of them independently in under an hour each, you would be pleased with their performance. That is what this number represents.

The 43% it cannot solve is equally informative. That represents:

Tasks requiring complex multi-file architectural changes
Tasks where the issue description is ambiguous
Tasks in unfamiliar domains or codebases
Tasks requiring knowledge that is not in the code (external APIs, deployment environments)
Tasks where the fix requires non-obvious domain expertise

Failure Mode Analysis

Understanding where agents fail is as important as knowing their score. Analysis of SWE-bench trajectories reveals consistent patterns:

Failure Type 1: Incorrect Diagnosis

The agent reads the issue, finds the symptom location, and fixes the symptom rather than the cause. The test still fails because the root cause is elsewhere.

Example: Issue says "function returns None when called with empty list." Agent adds a null check at the call site. But the actual bug is in a utility function three calls deep that corrupts the return value.

Frequency: ~25% of failures.

Failure Type 2: Correct Fix, Wrong Location

The agent understands what needs to change but modifies the wrong file or wrong function. The logic change is correct; the location is wrong.

Example: Issue involves a Django model. Agent fixes the view function (correct logic) but the bug is actually in the model's save() method.

Frequency: ~15% of failures.

Failure Type 3: Context Exhaustion

The agent runs out of context window before completing the task. It fills context with irrelevant files, repeated tool calls, or verbose reasoning, and then cannot proceed.

Frequency: ~20% of failures on complex, multi-file tasks.

Failure Type 4: Test Suite Blindness

The agent makes a change that fixes the described behavior but breaks other tests it did not run. The final evaluation fails because unrelated tests regressed.

Frequency: ~10% of failures.

Failure Type 5: Edit Execution Errors

The agent's edit command fails - old_str not found due to whitespace differences, encoding issues, or the file being different than expected. The agent gives up or produces a corrupted file.

Frequency: ~8% of failures.

Failure Type 6: Scope Creep

The agent decides to refactor, improve, or "clean up" code beyond what the task requires. These changes introduce unintended side effects.

Frequency: ~7% of failures.

Repository-Specific Performance

Performance varies dramatically across repositories. Analysis reveals:

Higher-performing repos (better scores):

requests - Clean, well-organized, simple bugs
flask - Compact codebase, good documentation
pytest - Well-defined behavior, excellent test suite

Lower-performing repos (worse scores):

sympy - Complex mathematics requiring domain knowledge
matplotlib - Visual output hard to verify programmatically
django - Large codebase, complex ORM interactions, subtle behavioral dependencies

The lesson: agent performance is heavily influenced by codebase properties, not just task difficulty.

Setting Up SWE-bench Locally

Running SWE-bench requires Docker and the evaluation harness. Here is how to set it up:

# Clone the SWE-bench repository
git clone https://github.com/princeton-nlp/SWE-bench.git
cd SWE-bench

# Install dependencies
pip install -e .

# The evaluation harness is the swebench package
# It will build Docker containers for each task you want to evaluate

Writing a Simple Agent Wrapper

To evaluate your agent on SWE-bench, you need to produce prediction files - JSON files mapping instance IDs to your agent's patch output.

"""
swebench_agent_wrapper.py - Wrap a coding agent for SWE-bench evaluation.
"""

import json
from pathlib import Path
from typing import Optional
import subprocess
import tempfile
import os


def get_git_diff(repo_path: str) -> str:
    """Get the current git diff as a unified patch."""
    result = subprocess.run(
        ["git", "diff", "--no-color"],
        cwd=repo_path,
        capture_output=True,
        text=True,
    )
    return result.stdout


def run_agent_on_instance(
    instance_id: str,
    repo_path: str,
    issue_text: str,
    agent_fn,  # callable(repo_path, task) -> None (modifies repo in place)
) -> Optional[str]:
    """
    Run an agent on a single SWE-bench instance.

    Returns the git diff patch if the agent made changes, else None.
    """
    # Reset any existing changes
    subprocess.run(["git", "checkout", "."], cwd=repo_path, check=True)

    # Run the agent
    try:
        agent_fn(repo_path=repo_path, task=issue_text)
    except Exception as e:
        print(f"Agent failed on {instance_id}: {e}")
        return None

    # Collect the diff
    diff = get_git_diff(repo_path)
    if not diff.strip():
        print(f"Agent made no changes on {instance_id}")
        return None

    return diff


def evaluate_predictions(
    predictions_file: str,
    dataset_name: str = "SWE-bench_Verified",
    split: str = "test",
) -> dict:
    """
    Run the SWE-bench evaluation harness on a predictions file.

    predictions_file: JSON file mapping instance_id -> {"model_patch": "..."}
    Returns: evaluation results dict
    """
    result = subprocess.run(
        [
            "python", "-m", "swebench.harness.run_evaluation",
            "--dataset_name", dataset_name,
            "--split", split,
            "--predictions_path", predictions_file,
            "--max_workers", "4",
            "--run_id", "my_agent_eval",
        ],
        capture_output=True,
        text=True,
    )
    print(result.stdout)
    if result.returncode != 0:
        print("STDERR:", result.stderr)
    return {}


def build_predictions_file(
    instances: list[dict],
    agent_fn,
    output_path: str = "predictions.json",
) -> str:
    """
    Run an agent on a list of SWE-bench instances and write a predictions file.

    instances: list of dicts with keys: instance_id, repo, base_commit, problem_statement
    agent_fn: callable(repo_path, task) that modifies the repo in place
    """
    predictions = {}

    with tempfile.TemporaryDirectory() as tmpdir:
        for instance in instances:
            instance_id = instance["instance_id"]
            repo_url = f"https://github.com/{instance['repo']}.git"
            repo_path = os.path.join(tmpdir, instance_id)

            print(f"\nProcessing: {instance_id}")

            # Clone and checkout the correct base commit
            subprocess.run(
                ["git", "clone", repo_url, repo_path],
                check=True,
                capture_output=True,
            )
            subprocess.run(
                ["git", "checkout", instance["base_commit"]],
                cwd=repo_path,
                check=True,
                capture_output=True,
            )

            # Run the agent
            patch = run_agent_on_instance(
                instance_id=instance_id,
                repo_path=repo_path,
                issue_text=instance["problem_statement"],
                agent_fn=agent_fn,
            )

            predictions[instance_id] = {
                "model_patch": patch or "",
                "model_name_or_path": "my-coding-agent",
            }

    # Write predictions
    with open(output_path, "w") as f:
        json.dump(predictions, f, indent=2)

    print(f"\nWrote {len(predictions)} predictions to {output_path}")
    return output_path

Building Domain-Specific Evaluation Inspired by SWE-bench

SWE-bench evaluates on public Python repos. Your production use case probably involves private code in different languages. Here is how to build a SWE-bench-inspired evaluation for your own codebase:

"""
custom_eval_harness.py - Build a SWE-bench-style eval for your own codebase.
"""

import json
import subprocess
import tempfile
import os
from pathlib import Path
from dataclasses import dataclass, asdict
from typing import Optional
import datetime


@dataclass
class EvalInstance:
    """A single evaluation task."""
    instance_id: str
    repo_path: str             # local path to repo
    base_commit: str           # git hash of the pre-fix state
    fixed_commit: str          # git hash of the fix (for reference)
    issue_description: str     # what the bug is
    test_command: str          # e.g., "pytest tests/test_users.py -v"
    expected_tests_pass: list[str]  # test IDs that must pass


@dataclass
class EvalResult:
    """Result of running an agent on an instance."""
    instance_id: str
    resolved: bool
    patch: str
    tests_passed: list[str]
    tests_failed: list[str]
    agent_steps: int
    error: Optional[str]
    timestamp: str


def collect_instances_from_git_history(
    repo_path: str,
    lookback_commits: int = 100,
) -> list[EvalInstance]:
    """
    Automatically collect evaluation instances from git history.
    Looks for commits that:
    - Reference an issue number (#NNN)
    - Added or modified test files
    """
    # Get recent commits
    result = subprocess.run(
        ["git", "log", f"-{lookback_commits}", "--oneline", "--no-decorate"],
        cwd=repo_path,
        capture_output=True,
        text=True,
    )

    instances = []
    lines = result.stdout.strip().split("\n")

    for line in lines:
        if not line.strip():
            continue
        parts = line.split(" ", 1)
        if len(parts) < 2:
            continue
        commit_hash, message = parts

        # Look for issue references
        if "#" not in message and "fix" not in message.lower() and "bug" not in message.lower():
            continue

        # Check if this commit modified test files
        files_result = subprocess.run(
            ["git", "diff-tree", "--no-commit-id", "-r", "--name-only", commit_hash],
            cwd=repo_path,
            capture_output=True,
            text=True,
        )
        changed_files = files_result.stdout.strip().split("\n")
        test_files = [f for f in changed_files if "test" in f.lower()]

        if not test_files:
            continue

        # Get the parent commit (pre-fix state)
        parent_result = subprocess.run(
            ["git", "rev-parse", f"{commit_hash}^"],
            cwd=repo_path,
            capture_output=True,
            text=True,
        )
        base_commit = parent_result.stdout.strip()

        # Get commit message body as issue description
        body_result = subprocess.run(
            ["git", "log", "-1", "--format=%B", commit_hash],
            cwd=repo_path,
            capture_output=True,
            text=True,
        )

        instance = EvalInstance(
            instance_id=f"{os.path.basename(repo_path)}-{commit_hash[:8]}",
            repo_path=repo_path,
            base_commit=base_commit,
            fixed_commit=commit_hash,
            issue_description=body_result.stdout.strip(),
            test_command="pytest",
            expected_tests_pass=[],  # will be populated by examining the fixed commit
        )
        instances.append(instance)

    return instances


def run_evaluation(
    instance: EvalInstance,
    agent_fn,
    verbose: bool = True,
) -> EvalResult:
    """Run an agent on a single eval instance and return the result."""
    timestamp = datetime.datetime.now().isoformat()

    with tempfile.TemporaryDirectory() as tmpdir:
        # Clone repo and checkout base commit
        repo_copy = os.path.join(tmpdir, "repo")
        subprocess.run(
            ["git", "clone", instance.repo_path, repo_copy],
            check=True,
            capture_output=True,
        )
        subprocess.run(
            ["git", "checkout", instance.base_commit],
            cwd=repo_copy,
            check=True,
            capture_output=True,
        )

        # Run the agent
        steps = 0
        error = None
        try:
            steps = agent_fn(
                repo_path=repo_copy,
                task=instance.issue_description,
            )
        except Exception as e:
            error = str(e)
            if verbose:
                print(f"Agent error: {e}")

        # Get the patch
        patch_result = subprocess.run(
            ["git", "diff", "--no-color"],
            cwd=repo_copy,
            capture_output=True,
            text=True,
        )
        patch = patch_result.stdout

        # Run the test suite
        test_result = subprocess.run(
            instance.test_command.split(),
            cwd=repo_copy,
            capture_output=True,
            text=True,
            timeout=300,
        )

        # Parse test results
        passed, failed = parse_pytest_output(test_result.stdout + test_result.stderr)
        resolved = len(failed) == 0 and len(passed) > 0

        if verbose:
            status = "RESOLVED" if resolved else "UNRESOLVED"
            print(f"  {instance.instance_id}: {status} ({len(passed)} passed, {len(failed)} failed)")

        return EvalResult(
            instance_id=instance.instance_id,
            resolved=resolved,
            patch=patch,
            tests_passed=passed,
            tests_failed=failed,
            agent_steps=steps,
            error=error,
            timestamp=timestamp,
        )


def parse_pytest_output(output: str) -> tuple[list[str], list[str]]:
    """Parse pytest output to extract passed and failed test IDs."""
    passed = []
    failed = []

    for line in output.split("\n"):
        line = line.strip()
        if " PASSED" in line:
            test_id = line.split(" PASSED")[0].strip()
            passed.append(test_id)
        elif " FAILED" in line:
            test_id = line.split(" FAILED")[0].strip()
            failed.append(test_id)
        elif " ERROR" in line and "::" in line:
            test_id = line.split(" ERROR")[0].strip()
            failed.append(test_id)

    return passed, failed


def run_benchmark(
    instances: list[EvalInstance],
    agent_fn,
    output_file: str = "eval_results.json",
) -> dict:
    """Run a full benchmark evaluation."""
    results = []

    for i, instance in enumerate(instances):
        print(f"\n[{i+1}/{len(instances)}] {instance.instance_id}")
        result = run_evaluation(instance, agent_fn)
        results.append(asdict(result))

    # Compute summary stats
    total = len(results)
    resolved = sum(1 for r in results if r["resolved"])
    resolution_rate = resolved / total if total > 0 else 0

    summary = {
        "total": total,
        "resolved": resolved,
        "unresolved": total - resolved,
        "resolution_rate": resolution_rate,
        "results": results,
    }

    with open(output_file, "w") as f:
        json.dump(summary, f, indent=2)

    print(f"\n{'='*50}")
    print(f"Benchmark Results:")
    print(f"  Total: {total}")
    print(f"  Resolved: {resolved} ({resolution_rate:.1%})")
    print(f"  Results written to: {output_file}")

    return summary

Overfitting Concerns and Benchmark Contamination

The Contamination Problem

Training data contamination is a legitimate concern with SWE-bench. If the training data for a model includes the GitHub PRs that SWE-bench tests are drawn from, the model may have seen the solutions during training.

The SWE-bench Verified team partially addressed this by selecting issues from after certain training data cutoffs. But as models are continuously updated, contamination analysis becomes ongoing work.

Signs of potential contamination:

Performance on specific repos significantly exceeds the broader benchmark
The agent reproduces exact solutions that are not inferable from the issue description
Performance degrades sharply on tasks from repositories that are less represented in training data

Overfitting to the Benchmark

As SWE-bench becomes the standard, there is pressure to optimize specifically for it. This can manifest as:

System prompt engineering - prompts tuned specifically for the types of issues in SWE-bench repos
Repository-specific fine-tuning - fine-tuning on Django or scikit-learn issues
Evaluation harness awareness - agents that detect they are being evaluated and behave differently

This is why MultiSWE-bench (multiple languages, different repos) is valuable - it is harder to overfit to a broader, more diverse benchmark.

Verification of Benchmark Claims

When you read a benchmark claim:

Which variant? SWE-bench, Lite, or Verified are different benchmarks
What evaluation conditions? Internet access, time limit, model version
Human evaluation? Were results spot-checked by humans?
Independently reproduced? Can another team reproduce the result?
When? SOTA changes fast - a 6-month-old number may be outdated

:::tip Build your own eval For production systems, SWE-bench is a proxy. Build your own evaluation using real tasks from your actual codebase. Collect 20-50 historical bug fixes, write the evaluation harness, and measure your agent on those. That number will be far more predictive of real-world utility than a public benchmark score. :::

:::danger Don't optimize for SWE-bench alone A 60% SWE-bench score does not mean 60% of your team's tasks will be automated. SWE-bench covers Python, bug fixes, and specific repository types. Your codebase is different. Use SWE-bench as a relative comparison tool, not an absolute predictor. :::

Interview Q&A

Q: What is SWE-bench and why is it considered the gold standard for evaluating coding agents?

A: SWE-bench is a benchmark of 2,294 real GitHub issues from 12 popular Python repositories. Each task provides the pre-fix repository state and an issue description; the agent must modify the code so that the held-out test suite passes. It is considered authoritative because: (1) it uses real software engineering tasks, not artificially constructed puzzles; (2) test-based evaluation is objective and reproducible; (3) it is widely adopted, enabling direct comparison across systems; (4) the tasks are genuinely difficult - requiring code navigation, root cause analysis, and correct edits.

Q: What is the difference between SWE-bench, SWE-bench Lite, and SWE-bench Verified?

A: SWE-bench is the full 2,294-task benchmark. SWE-bench Lite is a 300-task random subset for faster evaluation. SWE-bench Verified is a 500-task human-verified subset where each task was manually reviewed to ensure it is well-specified, solvable from the given information, and fairly testable. Verified is generally preferred for current evaluation because it removes ambiguous or flawed tasks that made the original benchmark noisy.

Q: If Claude achieves 57% on SWE-bench Verified, what does that mean practically?

A: It means Claude autonomously resolves 57% of real GitHub bug reports - finding the relevant code, understanding the bug, making the correct fix, and passing the test suite - with no human assistance. For context: these are real issues that required human software engineers to fix when they were originally encountered. 57% autonomous resolution represents a dramatic productivity multiplier for tasks of this type. The 43% it cannot solve tends to involve complex architectural changes, ambiguous specifications, or domain-specific knowledge not present in the codebase.

Q: How would you build a SWE-bench-style evaluation for a private codebase?

A: The key steps: (1) Collect historical bug fixes from git history - commits that reference issues and modified test files; (2) For each fix, record the pre-fix repository state (base commit) and the issue description; (3) Build a test runner that checks out the base commit, runs the agent, then executes the test suite; (4) Score binary: resolved if all tests pass, unresolved otherwise. The most important aspect is having a reliable, automated test suite. Without tests, you cannot objectively measure whether the fix is correct.

Q: What are the main ways coding agents fail on SWE-bench tasks?

A: The top failure modes are: incorrect diagnosis (~25%) - fixing the symptom rather than the root cause; context exhaustion (~20%) - filling the context window with irrelevant files before understanding the real problem; wrong location (~15%) - correct fix logic but applied to the wrong function or file; test blindness (~10%) - the fix passes the targeted test but breaks unrelated tests; edit execution errors (~8%) - the old_str not matching exactly due to whitespace or encoding differences; and scope creep (~7%) - unnecessary refactoring that introduces regressions.

Q: How should engineers interpret published benchmark numbers when evaluating coding agent products?

A: With significant skepticism and several questions: Which benchmark variant? (Verified vs full vs Lite are not comparable.) What conditions? (Internet access, time limit, model version all matter.) Is it independently verified? (Cognition's Devin claimed 13.86% but independent evaluation found ~4.8% under controlled conditions.) How recent? (SOTA moves fast; six-month-old numbers may be significantly behind.) And most importantly: does the benchmark task distribution match your actual use case? A 57% SWE-bench score on Python bug fixes tells you little about how the agent will perform on your TypeScript monorepo.

Why Evaluation Matters More Than You Think​

The SWE-bench Paper: Origins and Design​

Dataset Construction​

The Evaluation Protocol​

SWE-bench Architecture​

Binary Scoring​

SWE-bench Variants​

SWE-bench Lite​

SWE-bench Verified​

SWE-bench Multilingual (MultiSWE-bench)​

Current SOTA: What the Numbers Mean​

Failure Mode Analysis​

Failure Type 1: Incorrect Diagnosis​

Failure Type 2: Correct Fix, Wrong Location​

Failure Type 3: Context Exhaustion​

Failure Type 4: Test Suite Blindness​

Failure Type 5: Edit Execution Errors​

Failure Type 6: Scope Creep​

Repository-Specific Performance​

Setting Up SWE-bench Locally​

Writing a Simple Agent Wrapper​

Building Domain-Specific Evaluation Inspired by SWE-bench​

Overfitting Concerns and Benchmark Contamination​

The Contamination Problem​

Overfitting to the Benchmark​

Verification of Benchmark Claims​

Interview Q&A​