Skip to main content

How Coding Agents Work

In January 2024, Cognition AI published a number that made the software engineering world pay attention: 13.86% on SWE-bench. Devin had autonomously fixed nearly 1 in 7 real GitHub issues - tasks that require reading existing code, understanding intent, finding the relevant files, making the right changes, and passing the original test suite.

By late 2024, Claude 3.5 Sonnet hit 49% on SWE-bench Verified. By 2025, Claude was solving more than half of real-world GitHub issues without human intervention.

These systems are not autocomplete. They are not doing fancy text prediction and hoping it works. They are running a structured agent loop: understanding the task, navigating the codebase, planning a sequence of edits, executing those edits using real tools, running actual tests, reading the output, and iterating.

This lesson explains exactly how that works - architecturally, mechanically, and in code.


Why Coding Agents Were Inevitable

Before coding agents, the state of the art was code completion: you type, the model suggests the next few tokens. GitHub Copilot (June 2021) made this fast and accurate enough to be genuinely useful. But completion has a hard ceiling.

Completion works when you know what to write. It fails when you need to:

  • Find the right file to modify in a 50,000-line codebase
  • Understand how three modules interact before touching any of them
  • Diagnose a bug that manifests in one file but originates in another
  • Make changes, run tests, see what broke, and fix that too

These tasks require agency - the ability to take sequential actions, observe results, and adapt. The shift from completion to agents is not incremental. It is a different paradigm.

The key insight that unlocked coding agents was combining three things that individually existed but had never been put together:

  1. LLMs that can reason about code at a high level
  2. Tool use (function calling) that lets the LLM actually execute operations
  3. A structured loop that runs until the task is complete or time runs out

:::tip 🎮 Interactive Playground Visualize this concept: Try the Coding Agent Loop demo on the EngineersOfAI Playground - no code required. :::

Brief History: From Completion to Agents

June 2021 - GitHub Copilot launches. Code completion. Generates next-line suggestions using Codex.

March 2022 - AlphaCode (DeepMind) achieves competitive programming scores. Generates whole solutions, no iteration.

March 2023 - GPT-4 releases. Code quality dramatically improves. ChatGPT code interpreter launches - first mass-market tool that runs code in a sandbox.

October 2023 - SWE-bench paper published by Princeton and University of Chicago. For the first time, there is a rigorous way to evaluate agents on real software engineering tasks.

January 2024 - Devin (Cognition AI) announces 13.86% on SWE-bench. First agent to meaningfully close real GitHub issues autonomously.

June 2024 - SWE-bench Verified launches. Human-verified subset of 500 tasks. Claude 3.5 Sonnet achieves 49%.

October 2024 - Claude Code launches in beta. Runs in a terminal. Uses extended thinking. Integrates deeply with the developer's actual environment.

2025 - The SOTA continues to climb. The question is no longer "can agents code?" but "how do we integrate them into engineering workflows?"


The Coding Agent Architecture

At the highest level, a coding agent is an LLM in a loop with a set of tools and a mechanism to feed tool outputs back into the conversation.

This loop has five distinct phases. Let's walk through each one in detail.


Phase 1: Context Assembly

Before the LLM can reason about the task, it needs to understand the codebase. The core problem: most real codebases are far too large to fit in a context window.

A typical production codebase might have:

  • 100,000+ lines of Python
  • 500+ files
  • Complex interdependencies across modules
  • Test suites, configuration, documentation

Even at 200,000 tokens of context, you cannot fit all of this. The agent must be selective.

The Repo Map

The most effective technique is a repo map: a compact index of the codebase that captures the structure without the full content.

A repo map typically includes:

  • Every file path in the repository
  • For each file: class names, function names, method signatures
  • Import relationships between files

Here is what a repo map entry looks like:

src/api/users.py
class UserService:
+ __init__(self, db: Database)
+ get_user(self, user_id: int) -> User
+ create_user(self, email: str, name: str) -> User
+ delete_user(self, user_id: int) -> bool
def authenticate(token: str) -> User | None

This gives the LLM a navigational map. It can look at 500 entries like this and understand "I need to modify UserService.get_user in src/api/users.py" without reading every line of every file.

Building a Repo Map with Python's AST

import ast
import os
from pathlib import Path
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class FunctionInfo:
name: str
args: list[str]
return_annotation: Optional[str]
is_method: bool = False
is_classmethod: bool = False
is_property: bool = False

@dataclass
class ClassInfo:
name: str
bases: list[str]
methods: list[FunctionInfo] = field(default_factory=list)
class_vars: list[str] = field(default_factory=list)

@dataclass
class FileInfo:
path: str
imports: list[str] = field(default_factory=list)
classes: list[ClassInfo] = field(default_factory=list)
functions: list[FunctionInfo] = field(default_factory=list)
top_level_vars: list[str] = field(default_factory=list)

class RepoMapBuilder(ast.NodeVisitor):
"""Parse a Python file and extract structural information."""

def __init__(self):
self.classes: list[ClassInfo] = []
self.functions: list[FunctionInfo] = []
self.imports: list[str] = []
self.top_level_vars: list[str] = []
self._current_class: Optional[ClassInfo] = None

def visit_Import(self, node: ast.Import):
for alias in node.names:
name = alias.asname if alias.asname else alias.name
self.imports.append(name)

def visit_ImportFrom(self, node: ast.ImportFrom):
module = node.module or ""
for alias in node.names:
name = alias.asname if alias.asname else alias.name
self.imports.append(f"{module}.{name}")

def visit_ClassDef(self, node: ast.ClassDef):
bases = []
for base in node.bases:
if isinstance(base, ast.Name):
bases.append(base.id)
elif isinstance(base, ast.Attribute):
bases.append(f"{base.value.id}.{base.attr}") # type: ignore

cls = ClassInfo(name=node.name, bases=bases)
self._current_class = cls
self.generic_visit(node) # visit methods
self._current_class = None
self.classes.append(cls)

def visit_FunctionDef(self, node: ast.FunctionDef):
args = []
for arg in node.args.args:
if arg.arg != "self" and arg.arg != "cls":
args.append(arg.arg)

# Get return annotation
return_ann = None
if node.returns:
return_ann = ast.unparse(node.returns)

# Detect decorators
is_classmethod = any(
(isinstance(d, ast.Name) and d.id == "classmethod")
for d in node.decorator_list
)
is_property = any(
(isinstance(d, ast.Name) and d.id == "property")
for d in node.decorator_list
)

func = FunctionInfo(
name=node.name,
args=args,
return_annotation=return_ann,
is_method=self._current_class is not None,
is_classmethod=is_classmethod,
is_property=is_property,
)

if self._current_class is not None:
self._current_class.methods.append(func)
else:
self.functions.append(func)

visit_AsyncFunctionDef = visit_FunctionDef # handle async too

def visit_Assign(self, node: ast.Assign):
if self._current_class is None:
for target in node.targets:
if isinstance(target, ast.Name):
self.top_level_vars.append(target.id)


def build_file_info(filepath: str) -> Optional[FileInfo]:
"""Parse a single Python file and return its structural info."""
try:
source = Path(filepath).read_text(encoding="utf-8", errors="replace")
tree = ast.parse(source)
visitor = RepoMapBuilder()
visitor.visit(tree)
return FileInfo(
path=filepath,
imports=visitor.imports,
classes=visitor.classes,
functions=visitor.functions,
top_level_vars=visitor.top_level_vars,
)
except (SyntaxError, UnicodeDecodeError):
return None


def render_file_info(info: FileInfo, relative_to: str = "") -> str:
"""Render a FileInfo into a compact, readable repo map entry."""
lines = []
path = info.path
if relative_to:
path = os.path.relpath(info.path, relative_to)
lines.append(path)

for cls in info.classes:
bases_str = f"({', '.join(cls.bases)})" if cls.bases else ""
lines.append(f" class {cls.name}{bases_str}:")
for method in cls.methods:
prefix = " @classmethod" if method.is_classmethod else ""
if prefix:
lines.append(f" {prefix}")
args_str = ", ".join(method.args[:4]) # truncate long arg lists
if len(method.args) > 4:
args_str += ", ..."
ret = f" -> {method.return_annotation}" if method.return_annotation else ""
lines.append(f" + {method.name}(self, {args_str}){ret}")

for func in info.functions:
args_str = ", ".join(func.args[:4])
ret = f" -> {func.return_annotation}" if func.return_annotation else ""
lines.append(f" def {func.name}({args_str}){ret}")

return "\n".join(lines)


def build_repo_map(root: str, max_files: int = 200) -> str:
"""Build a full repo map for the given directory."""
entries = []
count = 0
for dirpath, dirnames, filenames in os.walk(root):
# Skip hidden dirs and common noise
dirnames[:] = [
d for d in dirnames
if not d.startswith(".") and d not in {"__pycache__", "node_modules", ".git", "venv", ".venv"}
]
for fname in filenames:
if not fname.endswith(".py"):
continue
if count >= max_files:
break
filepath = os.path.join(dirpath, fname)
info = build_file_info(filepath)
if info:
rendered = render_file_info(info, relative_to=root)
if rendered.strip():
entries.append(rendered)
count += 1

return "\n\n".join(entries)

Relevance Filtering

A 200-file repo map is still 10,000 tokens. The agent should include only the files most relevant to the current task. Common strategies:

  1. Keyword matching - Find files whose function/class names match words in the task description
  2. Import graph traversal - If the task mentions UserService, include files that import it
  3. Embedding search - Embed the task description and retrieve the most similar file summaries
  4. History-based - Include files that were modified in recent commits related to similar issues

Phase 2: The Coding Loop

Once the agent has context, it enters the reasoning-action loop. This is the ReAct pattern applied to software engineering.

The loop looks like this in pseudocode:

while not done and steps < max_steps:
response = llm(messages + tools)

if response.has_tool_calls:
for tool_call in response.tool_calls:
result = execute_tool(tool_call)
messages.append(tool_result(result))
else:
done = True
final_response = response.text

The LLM decides what to do. The tool execution is deterministic. The results come back into context. The LLM sees them and decides what to do next.


Phase 3: Code Understanding Tools

Before making any edit, the agent needs to understand the code. The core understanding tools:

ToolPurpose
read_file(path)Read a specific file
search_files(pattern)Find where a symbol, string, or pattern appears
list_directory(path)See the directory structure
get_file_info(path)Get file size, type, line count without full read

The agent uses these to navigate to the relevant code before touching anything. A well-designed agent will always read before writing.


Phase 4: Edit Generation and Execution

Once the agent knows what needs to change, it generates an edit. There are several strategies (covered in depth in Lesson 03), but the most common for production agents is search-and-replace:

old_string: "the exact current code"
new_string: "the replacement code"

This is surgical. It changes exactly what needs to change and leaves everything else alone.


Phase 5: Verification

After every edit, the agent runs tests to verify:

result = run_tests("pytest tests/test_users.py -v")
# → "PASSED 12, FAILED 1"
# → Traceback: AssertionError at line 45

The test output becomes the next input to the LLM. If tests pass, the task is done. If they fail, the LLM reasons about the failure and plans the next edit.

This is what separates coding agents from code generation. The feedback loop turns a one-shot guess into an iterative engineering process.


A Real Trajectory: Fixing a Bug

Let me walk through what a real coding agent trajectory looks like on a concrete bug.

Task: "The calculate_discount function returns 0 for orders over $500 when the VIP flag is set. It should return 20%."

Step 1 - Agent reads the repo map, sees src/pricing/discounts.py has a calculate_discount function.

Step 2 - Agent calls read_file("src/pricing/discounts.py").

# Agent reads this:
def calculate_discount(order_total: float, is_vip: bool = False) -> float:
if order_total > 500:
return 0.0 # BUG: should be 0.20 for VIP
if is_vip:
return 0.15
return 0.0

Step 3 - Agent calls search_files("calculate_discount", "tests/") to find the relevant test file.

Step 4 - Agent reads the test file, finds the failing assertion:

def test_vip_large_order():
assert calculate_discount(600, is_vip=True) == 0.20

Step 5 - Agent reasons: "The function checks order_total > 500 first and returns 0.0 without checking is_vip. The condition order needs to be fixed."

Step 6 - Agent calls edit_file:

old_str: " if order_total > 500:\n return 0.0 # BUG: should be 0.20 for VIP\n if is_vip:\n return 0.15"
new_str: " if is_vip and order_total > 500:\n return 0.20\n if order_total > 500:\n return 0.10\n if is_vip:\n return 0.15"

Step 7 - Agent calls run_tests("pytest tests/test_discounts.py -v").

Output: PASSED 8, FAILED 0.

Step 8 - Task complete.

Total steps: 7. Total time: under a minute. A human engineer might take 15 minutes to navigate the same codebase and find the same bug.


Minimal Coding Agent: Complete Implementation

Here is a complete, runnable coding agent using the Anthropic API. This implements all the phases described above.

"""
minimal_coding_agent.py - A complete coding agent using the Anthropic API.

Usage:
export ANTHROPIC_API_KEY=your-key
python minimal_coding_agent.py --repo /path/to/repo --task "Fix the bug in calculate_discount"
"""

import os
import re
import subprocess
import argparse
from pathlib import Path
from typing import Any

import anthropic

# ── Tool definitions ──────────────────────────────────────────────────────────

TOOLS = [
{
"name": "read_file",
"description": "Read the contents of a file. Returns the file content with line numbers.",
"input_schema": {
"type": "object",
"properties": {
"path": {"type": "string", "description": "Absolute or relative path to the file."}
},
"required": ["path"],
},
},
{
"name": "write_file",
"description": "Write content to a file, creating it if it does not exist.",
"input_schema": {
"type": "object",
"properties": {
"path": {"type": "string", "description": "Path to the file to write."},
"content": {"type": "string", "description": "The full content to write."},
},
"required": ["path", "content"],
},
},
{
"name": "edit_file",
"description": (
"Make a surgical edit to a file by replacing an exact string. "
"old_str must match EXACTLY (including whitespace and indentation). "
"Use read_file first to see the exact current content."
),
"input_schema": {
"type": "object",
"properties": {
"path": {"type": "string"},
"old_str": {"type": "string", "description": "The exact string to find and replace."},
"new_str": {"type": "string", "description": "The string to replace it with."},
},
"required": ["path", "old_str", "new_str"],
},
},
{
"name": "bash",
"description": "Run a shell command. Use this to run tests, grep, find files, install packages, etc.",
"input_schema": {
"type": "object",
"properties": {
"command": {"type": "string", "description": "The shell command to run."},
"timeout": {"type": "integer", "description": "Timeout in seconds (default 30).", "default": 30},
},
"required": ["command"],
},
},
{
"name": "list_directory",
"description": "List files and directories in a given path.",
"input_schema": {
"type": "object",
"properties": {
"path": {"type": "string", "description": "Directory path to list."},
"max_depth": {"type": "integer", "description": "Max recursion depth (default 2).", "default": 2},
},
"required": ["path"],
},
},
]

# ── Tool implementations ──────────────────────────────────────────────────────

def tool_read_file(path: str) -> str:
"""Read a file and return its content with line numbers."""
p = Path(path)
if not p.exists():
return f"ERROR: File not found: {path}"
if not p.is_file():
return f"ERROR: Not a file: {path}"
try:
content = p.read_text(encoding="utf-8", errors="replace")
lines = content.splitlines()
numbered = "\n".join(f"{i+1:4d} | {line}" for i, line in enumerate(lines))
return f"File: {path}\nLines: {len(lines)}\n\n{numbered}"
except Exception as e:
return f"ERROR reading {path}: {e}"


def tool_write_file(path: str, content: str) -> str:
"""Write content to a file."""
p = Path(path)
try:
p.parent.mkdir(parents=True, exist_ok=True)
p.write_text(content, encoding="utf-8")
lines = len(content.splitlines())
return f"Written {lines} lines to {path}"
except Exception as e:
return f"ERROR writing {path}: {e}"


def tool_edit_file(path: str, old_str: str, new_str: str) -> str:
"""Make a surgical search-and-replace edit to a file."""
p = Path(path)
if not p.exists():
return f"ERROR: File not found: {path}"
try:
content = p.read_text(encoding="utf-8", errors="replace")
if old_str not in content:
# Helpful error: show similar lines
lines = content.splitlines()
first_line = old_str.splitlines()[0] if old_str else ""
similar = [f" {i+1}: {l}" for i, l in enumerate(lines) if first_line[:20] in l]
hint = "\nSimilar lines found:\n" + "\n".join(similar[:5]) if similar else ""
return f"ERROR: old_str not found in {path}.{hint}\nMake sure to copy the exact text including whitespace."
count = content.count(old_str)
if count > 1:
return f"ERROR: old_str appears {count} times in {path}. Provide more context to make it unique."
new_content = content.replace(old_str, new_str, 1)
p.write_text(new_content, encoding="utf-8")
return f"Edit applied to {path}. Replaced {len(old_str)} chars with {len(new_str)} chars."
except Exception as e:
return f"ERROR editing {path}: {e}"


def tool_bash(command: str, timeout: int = 30) -> str:
"""Run a shell command and return stdout + stderr."""
try:
result = subprocess.run(
command,
shell=True,
capture_output=True,
text=True,
timeout=timeout,
)
output = ""
if result.stdout:
output += result.stdout
if result.stderr:
output += "\nSTDERR:\n" + result.stderr
if not output.strip():
output = "(no output)"
# Truncate very long output
if len(output) > 8000:
output = output[:4000] + "\n\n[... truncated ...]\n\n" + output[-2000:]
return output
except subprocess.TimeoutExpired:
return f"ERROR: Command timed out after {timeout}s: {command}"
except Exception as e:
return f"ERROR running command: {e}"


def tool_list_directory(path: str, max_depth: int = 2) -> str:
"""List directory contents as a tree."""
p = Path(path)
if not p.exists():
return f"ERROR: Path not found: {path}"
if not p.is_dir():
return f"ERROR: Not a directory: {path}"

lines = [str(p)]
IGNORE = {".git", "__pycache__", "node_modules", ".venv", "venv", ".idea", ".mypy_cache"}

def _walk(directory: Path, prefix: str, depth: int):
if depth > max_depth:
return
try:
items = sorted(directory.iterdir(), key=lambda x: (x.is_file(), x.name))
except PermissionError:
return
for i, item in enumerate(items):
if item.name in IGNORE or item.name.startswith("."):
continue
connector = "└── " if i == len(items) - 1 else "├── "
lines.append(f"{prefix}{connector}{item.name}")
if item.is_dir() and depth < max_depth:
extension = " " if i == len(items) - 1 else "│ "
_walk(item, prefix + extension, depth + 1)

_walk(p, "", 1)
return "\n".join(lines)


def execute_tool(tool_name: str, tool_input: dict[str, Any]) -> str:
"""Dispatch a tool call to the appropriate implementation."""
if tool_name == "read_file":
return tool_read_file(tool_input["path"])
elif tool_name == "write_file":
return tool_write_file(tool_input["path"], tool_input["content"])
elif tool_name == "edit_file":
return tool_edit_file(tool_input["path"], tool_input["old_str"], tool_input["new_str"])
elif tool_name == "bash":
return tool_bash(tool_input["command"], tool_input.get("timeout", 30))
elif tool_name == "list_directory":
return tool_list_directory(tool_input["path"], tool_input.get("max_depth", 2))
else:
return f"ERROR: Unknown tool: {tool_name}"

# ── System prompt ─────────────────────────────────────────────────────────────

SYSTEM_PROMPT = """You are an expert software engineer. You have been given a task to complete on a codebase.

Your approach:
1. Understand the task completely before making any changes
2. Explore the codebase using list_directory and read_file to understand the structure
3. Find the relevant files using bash (grep, find) before reading them
4. Make minimal, surgical changes using edit_file (not whole-file rewrites)
5. Always run tests after making changes to verify correctness
6. If tests fail, read the error carefully and fix the root cause

Rules:
- Read before you write. Never modify a file you haven't read.
- Prefer edit_file over write_file for existing files - it's safer.
- Run tests frequently - after every meaningful change.
- If you're unsure which file to modify, grep for the relevant function name first.
- Keep your changes minimal. Don't refactor code unless asked.

When complete, summarize what you changed and why."""

# ── Main agent loop ───────────────────────────────────────────────────────────

def run_coding_agent(
task: str,
repo_path: str,
max_steps: int = 30,
verbose: bool = True,
) -> str:
"""Run the coding agent on a task."""
client = anthropic.Anthropic()

# Get initial repo structure for context
repo_structure = tool_list_directory(repo_path, max_depth=3)

# Build initial message
initial_message = f"""Task: {task}

Repository root: {repo_path}

Repository structure:
{repo_structure}

Please complete the task. Start by exploring the codebase to understand the relevant code, then make the necessary changes."""

messages = [{"role": "user", "content": initial_message}]

step = 0
while step < max_steps:
step += 1
if verbose:
print(f"\n{'='*60}")
print(f"Step {step}")
print('='*60)

# Call the LLM
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=4096,
system=SYSTEM_PROMPT,
tools=TOOLS,
messages=messages,
)

if verbose:
# Print any text the agent produces
for block in response.content:
if hasattr(block, "text") and block.text:
print(f"\nAgent: {block.text}")

# Append assistant response to history
messages.append({"role": "assistant", "content": response.content})

# Check if we are done
if response.stop_reason == "end_turn":
# Extract final text
for block in response.content:
if hasattr(block, "text") and block.text:
return block.text
return "Task completed."

# Execute tool calls
if response.stop_reason == "tool_use":
tool_results = []
for block in response.content:
if block.type != "tool_use":
continue
tool_name = block.name
tool_input = block.input
if verbose:
print(f"\n Tool: {tool_name}")
if tool_name in ("read_file", "list_directory"):
print(f" Input: {tool_input.get('path', '')}")
elif tool_name == "edit_file":
print(f" Input: {tool_input.get('path', '')} (edit)")
elif tool_name == "bash":
print(f" Command: {tool_input.get('command', '')[:80]}")

result = execute_tool(tool_name, tool_input)

if verbose:
preview = result[:200] + "..." if len(result) > 200 else result
print(f" Result: {preview}")

tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result,
})

messages.append({"role": "user", "content": tool_results})

return f"Agent stopped after {max_steps} steps without completing the task."


def main():
parser = argparse.ArgumentParser(description="Minimal coding agent")
parser.add_argument("--repo", required=True, help="Path to the repository")
parser.add_argument("--task", required=True, help="The coding task to complete")
parser.add_argument("--max-steps", type=int, default=30, help="Maximum agent steps")
args = parser.parse_args()

result = run_coding_agent(
task=args.task,
repo_path=args.repo,
max_steps=args.max_steps,
verbose=True,
)
print("\n" + "="*60)
print("FINAL RESULT:")
print("="*60)
print(result)


if __name__ == "__main__":
main()

Comparative Architecture: Claude Code vs Devin vs Cursor

Claude Code

Claude Code runs as a terminal application. Its key architectural choices:

  • No IDE required - It operates in your terminal, reading your filesystem directly
  • Extended thinking - Uses Claude's extended thinking mode for complex planning steps
  • Search-replace edits - Uses the old_str/new_str pattern for surgical edits
  • Streaming output - Shows its thinking in real time as it works
  • Permission model - Asks for confirmation before destructive operations

Devin (Cognition AI)

Devin runs in a full virtual machine:

  • Complete isolation - Has its own VM with browser, terminal, IDE
  • Long-horizon planning - Can work on tasks over hours, not just minutes
  • Browser use - Can look up documentation, search Stack Overflow
  • Human checkpoint - Can ask for clarification when stuck

Cursor Agent

Cursor integrates directly into VS Code:

  • Codebase index - Maintains a live embedding index of your codebase
  • Tab completion + agent - Combines completion and agent in one interface
  • Composer - Multi-file editing sessions with rollback
  • Context awareness - Uses VS Code's language server for type info

What the Agent Sees vs What Humans See

A key insight: coding agents don't "see" code the way humans do. Humans skim, pattern-match visually, and use spatial memory. Agents read linearly through text.

This means the agent's performance is heavily influenced by:

  • Code comments - Agents use comments to understand intent
  • Function naming - Clear names reduce the reasoning burden
  • File organization - Well-organized code is easier to navigate
  • Test quality - Tests that clearly explain expected behavior are invaluable

Production Notes

:::tip Context window economics Every token in context costs money. A repo map of 200 files might cost 5,000 tokens. Reading 5 full files might cost 15,000 more. Plan your context usage. Don't read files speculatively - read them because you have a reason. :::

:::warning The edit-verify gap The most common agent failure is making an edit and not verifying. Always run tests after every non-trivial change. An agent that edits and then marks the task complete without verifying is not a coding agent - it is code generation with extra steps. :::


Interview Q&A

Q: What is the fundamental difference between code completion (Copilot) and a coding agent?

A: Code completion is stateless one-shot prediction: given a prefix, predict the next tokens. It has no ability to observe whether its output is correct. A coding agent is a multi-step loop: it reads the codebase, plans changes, executes them using tools, observes the results (test output, errors), and iterates. The feedback loop is the key difference - agents can be wrong on step 1 and correct themselves by step 7.

Q: How do coding agents handle large codebases that don't fit in a context window?

A: Several techniques work together: (1) Repo maps - compact indexes of file structure, class names, and function signatures that give the LLM navigational awareness without full file content; (2) Retrieval - using grep or embeddings to find relevant files before reading them; (3) Selective reading - only reading files that are actually relevant to the task; (4) Import graph traversal - following import chains to understand dependencies.

Q: What is the SWE-bench benchmark and what does a 50% score mean?

A: SWE-bench is a dataset of 2,294 real GitHub issues from 12 Python repositories. Each issue comes with a test suite that was written as part of the bug fix. An agent is given the issue description and must fix the code such that the test suite passes. A 50% score means the agent autonomously fixed 50% of real GitHub issues without human assistance. This is remarkable - it means roughly half of a developer's bug-fixing work could potentially be automated.

Q: Why is the search-replace edit strategy preferred over whole-file replacement?

A: Whole-file replacement is problematic because: (1) It requires the agent to regenerate the entire file correctly, including parts it didn't need to change; (2) It is expensive in tokens; (3) It risks introducing regressions in unchanged parts. Search-replace is surgical - it changes exactly what needs to change and leaves everything else intact. The downside is that the old_str must match exactly, which requires the agent to read the current content precisely.

Q: What makes a coding agent fail on a task?

A: Common failure modes: (1) Context exhaustion - the agent fills its context window with irrelevant files and can no longer think clearly; (2) Edit errors - old_str not matching due to whitespace differences; (3) Incorrect diagnosis - fixing the wrong thing because the agent misunderstood the root cause; (4) Test blindness - not running tests or ignoring failing tests; (5) Scope creep - trying to refactor or improve code beyond what the task requires; (6) Infinite loops - the agent gets stuck repeating the same edit-fail cycle without changing its approach.

© 2026 EngineersOfAI. All rights reserved.