Test-Driven Agent Loops
There is a single technique that separates mediocre coding agents from excellent ones.
Not the model. Not the context size. Not the number of tools.
It is this: run tests after every edit and feed the output back into the next LLM call.
Tests are the ground truth for software. They are unambiguous. They do not have opinions. They do not care that the code looks elegant or that the variable name is descriptive. They pass or they fail, and when they fail, they tell you exactly what went wrong.
For an LLM, this is invaluable. An LLM cannot know whether its code is correct from first principles. It can reason about whether the code looks correct. But tests eliminate that ambiguity. The agent stops guessing and starts converging.
This is why Claude achieves 57% on SWE-bench while naive code generation achieves single digits. The difference is the feedback loop.
Why Tests Are the Perfect Agent Feedback Signal
Consider the alternatives to tests as feedback:
LLM self-evaluation - "Does this code look right to me?" This is just the same LLM reasoning about its own output. It has the same blind spots. It will confidently evaluate wrong code as correct.
Static analysis - Linters and type checkers catch some errors. They cannot catch logic errors. "The function returns the wrong value" is invisible to mypy.
Human review - Accurate but defeats the purpose. A human reviewing every edit is not an autonomous agent.
Tests - Written independently (ideally before the code). Run deterministically. Catch logic errors, type errors, edge cases, and integration issues. Return structured output the LLM can parse and reason about.
Tests are not perfect - they miss things too. But they are the best automated feedback signal available, and they are already present in any well-maintained codebase.
:::tip 🎮 Interactive Playground Visualize this concept: Try the Coding Agent Loop demo on the EngineersOfAI Playground - no code required. :::
The History: TDD + Agents
Test-Driven Development (TDD) was articulated by Kent Beck in his 2002 book "Test Driven Development: By Example," building on work from the Extreme Programming community. The cycle: write a failing test → write minimal code to pass it → refactor.
The insight that TDD maps perfectly onto agent loops came gradually as coding agents matured in 2023–2024. The SWE-bench benchmark was designed around it - each task is evaluated by whether tests pass. Agents that learned to run tests, parse output, and iterate converged to correct solutions far more reliably than agents that tried to reason their way to correct code in a single shot.
By 2024, every serious coding agent - Devin, Claude Code, SWE-agent - was built around the test feedback loop as its core iteration mechanism.
The TDD Agent Loop in Detail
Parsing Pytest Output
The agent must extract structured information from test output. Raw pytest output is noisy - progress bars, dots, header information, and verbose tracebacks all make it harder to extract the signal.
"""
test_output_parser.py - Parse pytest output into structured data.
"""
import re
from dataclasses import dataclass, field
from typing import Optional
from enum import Enum
class TestStatus(str, Enum):
PASSED = "PASSED"
FAILED = "FAILED"
ERROR = "ERROR"
SKIPPED = "SKIPPED"
XFAILED = "XFAILED" # expected failure
XPASSED = "XPASSED" # unexpected pass
@dataclass
class TestResult:
test_id: str # e.g., "tests/test_users.py::TestUsers::test_create_user"
status: TestStatus
duration_ms: Optional[float] = None
failure_message: Optional[str] = None
traceback: Optional[str] = None
captured_stdout: Optional[str] = None
@dataclass
class ParsedTestOutput:
raw_output: str
tests: list[TestResult] = field(default_factory=list)
passed: int = 0
failed: int = 0
errors: int = 0
skipped: int = 0
total: int = 0
duration_seconds: Optional[float] = None
summary_line: Optional[str] = None
collection_errors: list[str] = field(default_factory=list)
@property
def success(self) -> bool:
return self.failed == 0 and self.errors == 0
@property
def failed_tests(self) -> list[TestResult]:
return [t for t in self.tests if t.status in (TestStatus.FAILED, TestStatus.ERROR)]
def format_for_agent(self) -> str:
"""Format the results as a concise string for the LLM to reason about."""
parts = []
if self.collection_errors:
parts.append("COLLECTION ERRORS (code cannot be imported):")
for err in self.collection_errors[:3]:
parts.append(f" {err}")
if self.summary_line:
parts.append(f"SUMMARY: {self.summary_line}")
if self.failed_tests:
parts.append(f"\nFAILED TESTS ({len(self.failed_tests)}):")
for test in self.failed_tests[:5]: # Show up to 5 failures
parts.append(f"\n--- {test.test_id} ---")
if test.failure_message:
parts.append(test.failure_message[:500])
if test.traceback:
parts.append("Traceback (last few lines):")
tb_lines = test.traceback.split("\n")
# Show last 10 lines of traceback
parts.append("\n".join(tb_lines[-10:]))
if self.success and self.passed > 0:
parts.append(f"\nAll {self.passed} tests passed.")
return "\n".join(parts)
def parse_pytest_output(output: str) -> ParsedTestOutput:
"""
Parse pytest output into structured data.
Handles: -v (verbose), -q (quiet), --tb=short, --tb=long, --tb=no
"""
result = ParsedTestOutput(raw_output=output)
lines = output.split("\n")
# State machine for parsing
current_test_id = None
current_failure_lines = []
in_failure_section = False
in_traceback = False
traceback_lines = []
for i, line in enumerate(lines):
stripped = line.strip()
# Collection errors - import failures
if "ERROR collecting" in line or "ImportError" in line:
result.collection_errors.append(stripped)
# Verbose mode: "tests/test_x.py::test_name PASSED"
verbose_match = re.match(
r'^(.+?::.+?)\s+(PASSED|FAILED|ERROR|SKIPPED|XFAILED|XPASSED)(\s+\[[\d.]+s\])?',
stripped,
)
if verbose_match:
test_id = verbose_match.group(1)
status_str = verbose_match.group(2)
status = TestStatus(status_str)
# Duration
duration = None
if verbose_match.group(3):
dur_match = re.search(r'([\d.]+)s', verbose_match.group(3))
if dur_match:
duration = float(dur_match.group(1)) * 1000
test_result = TestResult(
test_id=test_id,
status=status,
duration_ms=duration,
)
result.tests.append(test_result)
if status == TestStatus.PASSED:
result.passed += 1
elif status in (TestStatus.FAILED, TestStatus.ERROR):
result.failed += 1
current_test_id = test_id
elif status == TestStatus.SKIPPED:
result.skipped += 1
continue
# Failure section header: "FAILED tests/test_x.py::test_name - AssertionError: ..."
failed_header = re.match(r'^FAILED (.+?) - (.+)$', stripped)
if failed_header:
test_id = failed_header.group(1)
message = failed_header.group(2)
# Find the matching TestResult and add the message
for test in result.tests:
if test.test_id == test_id:
test.failure_message = message
break
# Short traceback section
if stripped.startswith("_ " * 3) and "_" in stripped:
# Separator like "_________ test_name _________"
if current_test_id and current_failure_lines:
_update_test_traceback(result.tests, current_test_id, "\n".join(current_failure_lines))
current_test_id = None
current_failure_lines = []
# Extract test ID from separator
name_match = re.search(r'_+ (.+?) _+', stripped)
if name_match:
current_test_id = name_match.group(1)
in_failure_section = True
continue
if in_failure_section and current_test_id:
current_failure_lines.append(line)
# AssertionError - capture the key failure line
if "AssertionError:" in line or "assert " in line.lower() and "Error" in line:
for test in result.tests:
if test.test_id and test.failure_message is None:
test.failure_message = stripped[:200]
# Summary line: "3 passed, 1 failed in 0.25s"
summary_match = re.search(
r'(\d+\s+passed)?,?\s*(\d+\s+failed)?,?\s*(\d+\s+error)?,?\s*(\d+\s+skipped)?.+in\s+([\d.]+)s',
stripped,
re.IGNORECASE,
)
if summary_match:
result.summary_line = stripped
# Extract counts
passed_match = re.search(r'(\d+)\s+passed', stripped)
if passed_match and result.passed == 0:
result.passed = int(passed_match.group(1))
failed_match = re.search(r'(\d+)\s+failed', stripped)
if failed_match:
result.failed = int(failed_match.group(1))
error_match = re.search(r'(\d+)\s+error', stripped)
if error_match:
result.errors = int(error_match.group(1))
dur_match = re.search(r'in\s+([\d.]+)s', stripped)
if dur_match:
result.duration_seconds = float(dur_match.group(1))
result.total = result.passed + result.failed + result.errors + result.skipped
return result
def _update_test_traceback(tests: list[TestResult], test_id: str, traceback: str):
"""Update the traceback for a test result."""
for test in tests:
if test.test_id == test_id or test_id in (test.test_id or ""):
test.traceback = traceback
return
Targeted Test Running
Running the full test suite after every edit is expensive. For a codebase with 1000 tests, running them all takes minutes. The agent should run only tests that are relevant to the changed code.
"""
targeted_tests.py - Run only tests relevant to changed files.
"""
import ast
import subprocess
import re
from pathlib import Path
from typing import Optional
def find_tests_for_file(
changed_file: str,
repo_root: str,
test_dir: str = "tests",
) -> list[str]:
"""
Find test files that test the given source file.
Strategies:
1. Convention: src/users.py → tests/test_users.py
2. Import analysis: test files that import from the changed module
3. Symbol matching: test files that reference changed functions
"""
changed_path = Path(changed_file)
module_name = changed_path.stem # e.g., "users" from "users.py"
test_root = Path(repo_root) / test_dir
relevant_tests = []
# Strategy 1: Naming convention
convention_patterns = [
f"test_{module_name}.py",
f"{module_name}_test.py",
f"test_{module_name}*.py",
]
for pattern in convention_patterns:
matches = list(test_root.rglob(pattern))
relevant_tests.extend([str(m) for m in matches])
# Strategy 2: Import analysis
for test_file in test_root.rglob("test_*.py"):
if str(test_file) in relevant_tests:
continue
try:
source = test_file.read_text(encoding="utf-8", errors="replace")
tree = ast.parse(source)
except (SyntaxError, UnicodeDecodeError):
continue
for node in ast.walk(tree):
if isinstance(node, ast.ImportFrom):
if node.module and module_name in (node.module or ""):
relevant_tests.append(str(test_file))
break
elif isinstance(node, ast.Import):
for alias in node.names:
if module_name in alias.name:
relevant_tests.append(str(test_file))
break
return list(set(relevant_tests))
def run_targeted_tests(
changed_files: list[str],
repo_root: str,
test_dir: str = "tests",
fallback_to_all: bool = True,
) -> tuple[bool, str]:
"""
Run tests relevant to the changed files.
Returns (passed: bool, output: str).
"""
all_relevant = []
for f in changed_files:
relevant = find_tests_for_file(f, repo_root, test_dir)
all_relevant.extend(relevant)
all_relevant = list(set(all_relevant))
if not all_relevant:
if fallback_to_all:
print("No targeted tests found, running full suite...")
return run_full_suite(repo_root)
return True, "No relevant tests found."
# Run only the relevant test files
cmd = ["python", "-m", "pytest", "--tb=short", "-v"] + all_relevant
result = subprocess.run(
cmd,
cwd=repo_root,
capture_output=True,
text=True,
timeout=120,
)
output = result.stdout + result.stderr
passed = result.returncode == 0
return passed, output
def run_full_suite(repo_root: str, timeout: int = 300) -> tuple[bool, str]:
"""Run the complete test suite."""
result = subprocess.run(
["python", "-m", "pytest", "--tb=short", "-q"],
cwd=repo_root,
capture_output=True,
text=True,
timeout=timeout,
)
output = result.stdout + result.stderr
passed = result.returncode == 0
return passed, output
Writing Tests as Planning
One of the most powerful agent strategies is writing tests before implementation - the classic TDD approach. When the agent writes a test first, it is forced to articulate exactly what behavior it expects. This serves as planning.
Here is how to implement a TDD-first agent:
"""
tdd_agent.py - Agent that writes tests first, then implements.
"""
import anthropic
from dataclasses import dataclass
from typing import Optional
@dataclass
class TDDSession:
"""Tracks the state of a TDD agent session."""
task: str
test_file: Optional[str] = None
implementation_file: Optional[str] = None
test_iterations: int = 0
impl_iterations: int = 0
max_iterations: int = 5
TDD_SYSTEM_PROMPT = """You are an expert software engineer using Test-Driven Development.
Your workflow has two phases:
PHASE 1 - Write Tests First:
1. Read the task description carefully
2. Write a test file that defines the expected behavior
3. Tests should be concrete: use specific inputs and expected outputs
4. Include edge cases: empty inputs, boundary values, error cases
5. Run the tests to confirm they FAIL (proving the implementation doesn't exist yet)
PHASE 2 - Implement to Make Tests Pass:
1. Read your failing tests to understand exactly what is needed
2. Implement the minimum code to make ALL tests pass
3. Run tests after every implementation step
4. Do NOT add functionality not tested - keep scope tight
5. When all tests pass, you are done
Rules:
- Never skip Phase 1 - writing tests first improves your implementation
- Minimal implementation - just enough to pass tests, nothing more
- If tests keep failing after 3 attempts, re-read them and reconsider your approach
- Prefer simple solutions over clever ones"""
def run_tdd_agent(
task: str,
repo_path: str,
tools: list[dict],
execute_tool_fn,
max_iterations: int = 20,
) -> str:
"""
Run a TDD-first coding agent.
The agent writes tests first, then implements.
"""
client = anthropic.Anthropic()
session = TDDSession(task=task)
messages = [
{
"role": "user",
"content": (
f"Task: {task}\n\n"
f"Repository: {repo_path}\n\n"
"Start by writing tests that define the expected behavior, "
"then implement the code to make them pass.\n\n"
"Phase 1: Write tests first.\n"
"Phase 2: Implement to make tests pass."
),
}
]
for step in range(max_iterations):
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=4096,
system=TDD_SYSTEM_PROMPT,
tools=tools,
messages=messages,
)
messages.append({"role": "assistant", "content": response.content})
if response.stop_reason == "end_turn":
for block in response.content:
if hasattr(block, "text"):
return block.text
return "TDD session completed."
if response.stop_reason == "tool_use":
tool_results = []
for block in response.content:
if block.type != "tool_use":
continue
tool_name = block.name
result = execute_tool_fn(tool_name, block.input)
# Track TDD phases
if tool_name == "write_file" and "test" in block.input.get("path", "").lower():
session.test_file = block.input.get("path")
elif tool_name == "run_tests":
if session.test_file and not session.implementation_file:
session.test_iterations += 1
else:
session.impl_iterations += 1
# Check for iteration limit
if session.impl_iterations >= session.max_iterations:
backtrack_msg = (
"\n\nNOTE: You have made many implementation attempts. "
"Consider a completely different approach. "
"Re-read the failing tests carefully and think about "
"what they are actually testing."
)
result = result + backtrack_msg
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result,
})
messages.append({"role": "user", "content": tool_results})
return f"TDD agent stopped after {max_iterations} steps."
Handling Test Flakiness
Flaky tests are tests that sometimes pass and sometimes fail without any code change. They are a significant problem for agent loops because:
- The agent might think its edit broke something when the test was already flaky
- The agent might think its edit fixed something when the test happens to pass this run
"""
flakiness_handler.py - Detect and handle flaky tests in agent loops.
"""
import subprocess
import re
from dataclasses import dataclass, field
@dataclass
class TestRun:
test_id: str
passed: bool
output: str
@dataclass
class FlakinessAnalysis:
test_id: str
runs: list[TestRun] = field(default_factory=list)
@property
def pass_rate(self) -> float:
if not self.runs:
return 0.0
return sum(1 for r in self.runs if r.passed) / len(self.runs)
@property
def is_flaky(self) -> bool:
"""A test is flaky if it sometimes passes and sometimes fails."""
if len(self.runs) < 2:
return False
passes = [r.passed for r in self.runs]
return any(passes) and not all(passes)
def run_test_n_times(
test_id: str,
n: int = 3,
cwd: str = ".",
) -> FlakinessAnalysis:
"""
Run a specific test n times to detect flakiness.
A test that fails consistently is a real failure.
A test that passes sometimes is flaky.
"""
analysis = FlakinessAnalysis(test_id=test_id)
for _ in range(n):
result = subprocess.run(
["python", "-m", "pytest", test_id, "-v", "--tb=short", "--no-header"],
cwd=cwd,
capture_output=True,
text=True,
timeout=60,
)
passed = result.returncode == 0
analysis.runs.append(TestRun(
test_id=test_id,
passed=passed,
output=result.stdout + result.stderr,
))
return analysis
def classify_test_failure(
test_id: str,
output: str,
cwd: str = ".",
) -> dict:
"""
Classify a test failure as real, flaky, or environment issue.
Returns a dict with classification and recommendation.
"""
# Check for common environment failures
env_patterns = [
r"socket\.(timeout|error)",
r"ConnectionRefused",
r"No such file or directory",
r"PermissionError",
r"ResourceWarning",
r"ResourceExhausted",
r"docker",
r"network",
r"timeout",
]
for pattern in env_patterns:
if re.search(pattern, output, re.IGNORECASE):
return {
"type": "environment",
"confidence": "high",
"recommendation": "This looks like an environment issue, not a code bug. Check if external dependencies are available.",
}
# Check for assertion errors (real failures)
if "AssertionError" in output or "assert " in output:
return {
"type": "assertion",
"confidence": "high",
"recommendation": "Real failure - the code does not match expected behavior. Fix the implementation.",
}
# Run 3 times to detect flakiness
analysis = run_test_n_times(test_id, n=3, cwd=cwd)
if analysis.is_flaky:
return {
"type": "flaky",
"pass_rate": analysis.pass_rate,
"confidence": "high",
"recommendation": (
f"Test is flaky (passed {analysis.pass_rate:.0%} of the time). "
"Do not treat this as a real failure. "
"The test has an intermittent issue unrelated to your changes."
),
}
return {
"type": "real_failure",
"confidence": "medium",
"recommendation": "Consistent failure. Read the traceback carefully and fix the root cause.",
}
Coverage-Guided Iteration
After the basic TDD loop passes, the agent can use coverage data to find untested paths:
"""
coverage_guided.py - Use pytest-cov to find untested code paths.
"""
import subprocess
import json
import re
from pathlib import Path
def run_with_coverage(
test_path: str,
source_path: str,
cwd: str = ".",
) -> dict:
"""
Run tests with coverage and return coverage data.
Requires: pip install pytest-cov
"""
result = subprocess.run(
[
"python", "-m", "pytest",
test_path,
f"--cov={source_path}",
"--cov-report=json",
"--cov-report=term-missing",
"--tb=short",
"-q",
],
cwd=cwd,
capture_output=True,
text=True,
timeout=120,
)
coverage_data = {}
coverage_json = Path(cwd) / "coverage.json"
if coverage_json.exists():
with open(coverage_json) as f:
coverage_data = json.load(f)
return {
"output": result.stdout + result.stderr,
"passed": result.returncode == 0,
"coverage": coverage_data,
}
def find_uncovered_lines(coverage_data: dict, source_file: str) -> list[int]:
"""Extract uncovered line numbers for a specific source file."""
files = coverage_data.get("files", {})
# Try to find the file in coverage data
for filepath, data in files.items():
if source_file in filepath or Path(filepath).name == Path(source_file).name:
return data.get("missing_lines", [])
return []
def describe_uncovered_code(source_file: str, uncovered_lines: list[int]) -> str:
"""
Describe uncovered code sections to help the agent write targeted tests.
"""
if not uncovered_lines:
return "All lines covered."
content = Path(source_file).read_text(encoding="utf-8")
lines = content.splitlines()
# Group consecutive uncovered lines into sections
sections = []
current_section = []
for line_num in sorted(uncovered_lines):
if not current_section or line_num == current_section[-1] + 1:
current_section.append(line_num)
else:
sections.append(current_section)
current_section = [line_num]
if current_section:
sections.append(current_section)
descriptions = []
for section in sections:
start = section[0]
end = section[-1]
# Get the code with surrounding context
context_start = max(0, start - 3)
context_end = min(len(lines), end + 2)
code_section = "\n".join(
f"{i+1:4d} {'>>>' if i+1 in section else ' '} {lines[i]}"
for i in range(context_start, context_end)
)
descriptions.append(f"Lines {start}–{end}:\n{code_section}")
return (
f"Uncovered lines in {source_file}:\n\n" +
"\n\n".join(descriptions) +
f"\n\nTotal: {len(uncovered_lines)} uncovered lines in {len(sections)} sections."
)
The Complete TDD Agent Loop: Full Implementation
Here is the complete, self-contained implementation of a TDD agent loop with all features integrated:
"""
complete_tdd_loop.py - Complete TDD agent with test output parsing,
targeted test running, flakiness detection, and backtracking.
Usage:
export ANTHROPIC_API_KEY=your-key
python complete_tdd_loop.py --repo /path/to/repo --task "Add a function..."
"""
import argparse
import subprocess
from pathlib import Path
from typing import Any
import anthropic
from coding_agent_tools import ( # from Lesson 04
read_file, write_file, edit_file,
list_directory, search_files, bash,
git_status, git_diff,
)
from test_output_parser import parse_pytest_output
# ── Tool schemas (abbreviated - use full schemas from Lesson 04) ──────────────
TOOLS = [
{"name": "read_file", "description": "Read a file", "input_schema": {"type": "object", "properties": {"path": {"type": "string"}}, "required": ["path"]}},
{"name": "write_file", "description": "Write a file", "input_schema": {"type": "object", "properties": {"path": {"type": "string"}, "content": {"type": "string"}}, "required": ["path", "content"]}},
{"name": "edit_file", "description": "Edit a file", "input_schema": {"type": "object", "properties": {"path": {"type": "string"}, "old_str": {"type": "string"}, "new_str": {"type": "string"}}, "required": ["path", "old_str", "new_str"]}},
{"name": "bash", "description": "Run shell command", "input_schema": {"type": "object", "properties": {"command": {"type": "string"}}, "required": ["command"]}},
{"name": "list_directory", "description": "List directory", "input_schema": {"type": "object", "properties": {"path": {"type": "string"}}, "required": ["path"]}},
{"name": "search_files", "description": "Search files", "input_schema": {"type": "object", "properties": {"pattern": {"type": "string"}, "directory": {"type": "string"}}, "required": ["pattern"]}},
{"name": "run_tests", "description": "Run tests and return parsed output", "input_schema": {"type": "object", "properties": {"test_path": {"type": "string"}, "test_filter": {"type": "string"}}, "required": []}},
]
SYSTEM_PROMPT = """You are an expert software engineer. You use tests as your primary feedback mechanism.
Core workflow:
1. Read and understand the task
2. Find existing tests related to the task (search_files for test files)
3. Run the relevant tests to see current state
4. Make an edit
5. Run tests again immediately
6. If tests fail, read the output carefully and try a different approach
7. If tests pass, run the full suite to check for regressions
8. Done when all tests pass
Key rules:
- ALWAYS run tests after every edit. Never mark a task complete without running tests.
- If the same approach fails 3 times, try a completely different approach.
- Read the FULL error message - the relevant line is usually at the bottom.
- When tests fail, read both the assertion error AND the traceback.
- Prefer targeted test runs (specific file or -k filter) over the full suite.
- Only run the full suite when targeted tests pass."""
class IterationTracker:
"""Track agent iterations and detect loops."""
def __init__(self, max_same_approach: int = 3):
self.approach_history: list[str] = []
self.max_same = max_same_approach
self.total_test_runs = 0
self.consecutive_failures = 0
self.last_failure_message = ""
def record_test_run(self, output: str, passed: bool):
self.total_test_runs += 1
if passed:
self.consecutive_failures = 0
else:
self.consecutive_failures += 1
def should_backtrack(self) -> bool:
return self.consecutive_failures >= self.max_same
def get_backtrack_message(self) -> str:
return (
f"\n\nAGENT NOTE: You have had {self.consecutive_failures} consecutive test failures. "
"This suggests your current approach is not working. "
"Consider:\n"
"1. Re-reading the failing test to understand what it ACTUALLY checks\n"
"2. Reading the implementation file from scratch\n"
"3. Reverting your changes and trying a completely different approach\n"
"4. Checking if there are other related tests or files you haven't read"
)
def execute_tool(name: str, inputs: dict[str, Any], repo_path: str, tracker: IterationTracker) -> str:
"""Execute a tool and return the result."""
if name == "read_file":
return read_file(inputs["path"])
elif name == "write_file":
return write_file(inputs["path"], inputs["content"])
elif name == "edit_file":
return edit_file(inputs["path"], inputs["old_str"], inputs["new_str"])
elif name == "bash":
return bash(inputs["command"])
elif name == "list_directory":
return list_directory(inputs.get("path", repo_path))
elif name == "search_files":
return search_files(inputs["pattern"], inputs.get("directory", repo_path))
elif name == "run_tests":
# Run tests and parse output
test_path = inputs.get("test_path", ".")
test_filter = inputs.get("test_filter", "")
cmd = ["python", "-m", "pytest", test_path, "--tb=short", "-v"]
if test_filter:
cmd.extend(["-k", test_filter])
result = subprocess.run(
cmd,
cwd=repo_path,
capture_output=True,
text=True,
timeout=120,
)
output = result.stdout + result.stderr
passed = result.returncode == 0
tracker.record_test_run(output, passed)
# Parse for structured output
parsed = parse_pytest_output(output)
agent_view = parsed.format_for_agent()
# Add backtrack message if stuck
if tracker.should_backtrack():
agent_view += tracker.get_backtrack_message()
return agent_view
else:
return f"ERROR: Unknown tool: {name}"
def run_tdd_loop(task: str, repo_path: str, max_steps: int = 40, verbose: bool = True) -> str:
"""Run the complete TDD agent loop."""
client = anthropic.Anthropic()
tracker = IterationTracker()
# Get initial context
structure = list_directory(repo_path, max_depth=3)
messages = [
{
"role": "user",
"content": (
f"Task: {task}\n\n"
f"Repository root: {repo_path}\n\n"
f"Structure:\n{structure}\n\n"
"Use test-driven development. "
"Find or write tests first, then implement. "
"Run tests after every change."
),
}
]
for step in range(max_steps):
if verbose:
print(f"\n[Step {step + 1}/{max_steps}]", end=" ", flush=True)
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=4096,
system=SYSTEM_PROMPT,
tools=TOOLS,
messages=messages,
)
messages.append({"role": "assistant", "content": response.content})
if response.stop_reason == "end_turn":
for block in response.content:
if hasattr(block, "text") and block.text:
if verbose:
print(f"\nFinal: {block.text[:200]}")
return block.text
return "Task completed."
if response.stop_reason == "tool_use":
tool_results = []
for block in response.content:
if block.type != "tool_use":
continue
if verbose:
print(f"{block.name}", end=" ", flush=True)
result = execute_tool(block.name, block.input, repo_path, tracker)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result,
})
messages.append({"role": "user", "content": tool_results})
return f"Stopped after {max_steps} steps. Ran {tracker.total_test_runs} test runs."
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--repo", required=True)
parser.add_argument("--task", required=True)
parser.add_argument("--max-steps", type=int, default=40)
args = parser.parse_args()
result = run_tdd_loop(args.task, args.repo, args.max_steps)
print("\n" + "="*60)
print("RESULT:", result)
:::tip Always run tests after edits
Make it structurally impossible for the agent to complete a task without running tests. The system prompt should state this clearly. If the agent stops reasoning without having run tests and seeing them pass, inject a reminder. Some agent implementations track the last time run_tests was called and refuse to accept end_turn until tests have been run.
:::
:::warning Flaky tests can derail agents
If your codebase has flaky tests, the agent loop can get stuck: the test fails, the agent makes a change, the test passes (flakily), the agent thinks it fixed the problem. Later the test fails again. Identify and mark flaky tests with @pytest.mark.flaky (using the pytest-rerunfailures plugin) so agents can skip them or retry automatically.
:::
:::danger Don't use test output as the only feedback signal Tests pass when the code does what the tests say. Tests don't check: code security, performance, readability, backwards compatibility, documentation accuracy, or behavior in production environments. Use tests as the primary signal for correctness, but always have a human review the actual diff before merging agent-written code. :::
Interview Q&A
Q: Why is running tests the most powerful feedback technique for coding agents?
A: Tests provide unambiguous, automated, deterministic feedback about whether code is correct. An LLM cannot know if its code is right from reasoning alone - it has no ground truth. Tests provide that ground truth. The test either passes or fails. If it fails, the traceback and assertion error tell the agent exactly what went wrong and where. This transforms the task from one-shot generation ("generate correct code") into iterative convergence ("make edits until tests pass"), which is dramatically more reliable.
Q: What information should the agent extract from pytest output to guide its next action?
A: The most critical pieces are: (1) the summary line (N passed, M failed) to understand overall status; (2) for each failure, the test ID (so the agent knows which behavior is wrong); (3) the assertion error message (what the actual vs expected values were); (4) the last few lines of the traceback (which file and line number caused the failure). The agent should not be given the full verbose output - that's noise. The structured summary and failure details are what matter.
Q: How should a coding agent handle a situation where tests keep failing after multiple attempts?
A: Three strategies: (1) Backtracking - revert all changes and start from scratch with a different approach; (2) Diagnosis escalation - read more files, gather more context, re-read the failing test from scratch to make sure the understanding is correct; (3) Scope reduction - if a complex approach fails repeatedly, try the simplest possible implementation that could possibly work and iterate from there. The danger is the agent getting stuck in a local minimum - making small variations on the same wrong approach. A backtrack counter that forces a full revert after N consecutive failures prevents this.
Q: What is test-first agent development and why is it more reliable than implementation-first?
A: Test-first means the agent writes a test that defines expected behavior before writing any implementation. This is the TDD (Test-Driven Development) approach applied to agents. It is more reliable because: (1) writing the test forces the agent to articulate exactly what it needs to implement, which is planning; (2) the failing test provides a precise target for the implementation; (3) the agent can immediately verify whether its implementation is correct by running the test; (4) the test serves as documentation of the intended behavior. Agents that implement first often produce code that "looks right" but fails tests because the specification was ambiguous in their reasoning.
Q: How do you detect and handle flaky tests in an agent loop?
A: Run the failing test multiple times (2–3 runs) before treating it as a real failure. If the test passes on some runs and fails on others, it is flaky. The agent should log this, not treat it as a code failure, and report it to the human operator. For prevention: use pytest-rerunfailures to automatically retry flaky tests during the agent loop; maintain a known-flaky test list that the agent ignores when evaluating success; use --randomly-seed to detect ordering-dependent flakiness. The key is that the agent must distinguish between "my code is wrong" (consistent failure) and "this test is unreliable" (inconsistent failure).
