Building Your Own Coding Agent
Cursor, Copilot, Claude Code - all coding agents under the hood. Complex UIs, polished developer experiences, years of engineering. But the core of each one is the same thing: an LLM in a loop with a set of tools that read and modify code.
You can build one. Not in months. Today.
A functional coding agent - one that can read a codebase, plan edits, execute them, run tests, and iterate - takes about 500 lines of Python. That is what this lesson builds, step by step, explaining every architectural decision along the way.
By the end, you will have a working coding agent you can point at any repository.
Why Build Your Own?
Before diving in, the honest case for building your own agent rather than using Claude Code or Cursor:
Specialization - Generic agents make generic trade-offs. Your codebase might benefit from custom tools (a specific testing framework, an internal code review system, domain-specific linting). Custom agents can have these built in.
Integration - You might want the agent embedded in your CI pipeline, triggered by GitHub issues, or running in a custom IDE plugin. Claude Code runs in a terminal; your use case might not.
Cost and control - When you control the agent, you control the token budget, the model selection, the retry logic, and the logging.
Understanding - Building something from scratch is the deepest way to understand how it works. After this lesson, you will understand every claim any coding agent vendor makes about their product.
:::tip 🎮 Interactive Playground Visualize this concept: Try the Coding Agent Loop demo on the EngineersOfAI Playground - no code required. :::
Architecture Decisions Before Writing Code
Five decisions to make before writing the first line:
Decision 1: Raw API or Framework?
Raw Anthropic API - You write the loop, the tool dispatch, the message history management. Complete control. No abstractions to fight. About 100 lines to get a working agent.
LangChain/LangGraph - Framework handles message history, tool dispatch, and graph-based flows. More code to write up front (wiring), but easier to add complex behaviors later.
For this lesson: Raw API. Frameworks add indirection that obscures how things work. Once you understand the raw version, frameworks become trivial.
Decision 2: Which Tools?
Minimum viable tool set:
read_file- absolutely requirededit_file- required (orwrite_filefor new files)bash- most powerful tool, required for running testslist_directory- required for navigationsearch_files- required for large codebases
Optional but valuable:
get_python_symbols- speeds up navigationgit_status/git_diff- context about what changedfind_definition- avoids manual grep
Decision 3: Context Strategy?
Always include full repo map - Easy to implement. Uses 2,000–10,000 tokens per call. Works for codebases up to ~500 files.
Retrieval-based - Embed the task, find relevant files, only include those. Harder to implement. Necessary for very large codebases (1000+ files).
For this lesson: Repo map with relevance filtering. Good balance of simplicity and effectiveness.
Decision 4: Edit Strategy?
Search-and-replace (edit_file with old_str/new_str). As established in Lesson 03, this is the right default.
Decision 5: How to Handle Context Exhaustion?
When the conversation history grows too large, the agent cannot think clearly. Two strategies:
- Sliding window - Drop old messages, keep recent tool calls
- Summarization - Ask the LLM to summarize what it has done so far, compress history
For this lesson: sliding window with a configurable message limit.
The Minimal 100-Line Agent
Before building the full version, here is the minimal agent that actually works - 100 lines, everything essential, nothing extra:
"""
minimal_agent_v0.py - The simplest possible coding agent.
This version shows the core structure before adding features.
"""
import subprocess
from pathlib import Path
import anthropic
client = anthropic.Anthropic()
TOOLS = [
{"name": "read_file", "description": "Read a file.", "input_schema": {"type": "object", "properties": {"path": {"type": "string"}}, "required": ["path"]}},
{"name": "write_file", "description": "Write a file.", "input_schema": {"type": "object", "properties": {"path": {"type": "string"}, "content": {"type": "string"}}, "required": ["path", "content"]}},
{"name": "edit_file", "description": "Edit: replace old_str with new_str.", "input_schema": {"type": "object", "properties": {"path": {"type": "string"}, "old_str": {"type": "string"}, "new_str": {"type": "string"}}, "required": ["path", "old_str", "new_str"]}},
{"name": "bash", "description": "Run a shell command.", "input_schema": {"type": "object", "properties": {"command": {"type": "string"}}, "required": ["command"]}},
]
def execute(name, inputs):
if name == "read_file":
try:
content = Path(inputs["path"]).read_text()
lines = content.splitlines()
return "\n".join(f"{i+1:4d} | {l}" for i, l in enumerate(lines))
except Exception as e:
return f"ERROR: {e}"
elif name == "write_file":
Path(inputs["path"]).write_text(inputs["content"])
return f"Wrote {inputs['path']}"
elif name == "edit_file":
p = Path(inputs["path"])
content = p.read_text()
if inputs["old_str"] not in content:
return "ERROR: old_str not found. Use read_file first."
p.write_text(content.replace(inputs["old_str"], inputs["new_str"], 1))
return "Edit applied."
elif name == "bash":
r = subprocess.run(inputs["command"], shell=True, capture_output=True, text=True, timeout=30)
return (r.stdout + r.stderr) or "(no output)"
return f"Unknown tool: {name}"
def run(task, repo):
messages = [{"role": "user", "content": f"Task: {task}\nRepo: {repo}\n\nComplete the task."}]
for _ in range(30):
r = client.messages.create(model="claude-opus-4-5", max_tokens=4096, system="You are a coding agent. Read files before editing. Run tests to verify.", tools=TOOLS, messages=messages)
messages.append({"role": "assistant", "content": r.content})
if r.stop_reason == "end_turn":
return next((b.text for b in r.content if hasattr(b, "text")), "Done.")
results = [{"type": "tool_result", "tool_use_id": b.id, "content": execute(b.name, b.input)} for b in r.content if b.type == "tool_use"]
messages.append({"role": "user", "content": results})
return "Max steps reached."
if __name__ == "__main__":
import sys
print(run(sys.argv[1], sys.argv[2]))
That is a working coding agent. It reads files, writes files, makes edits, runs shell commands, and iterates. Point it at a real repository with a real task and it will work.
Now let's add everything that makes it production-quality.
Adding the Repo Map
The biggest improvement to the minimal agent is context management. Without a repo map, the agent must ask "what files exist?" on every run. With a repo map, it starts with a compact overview of the entire codebase.
"""
repo_map.py - Build a compact codebase index using AST.
"""
import ast
import os
from pathlib import Path
from typing import Optional
IGNORE_DIRS = {".git", "__pycache__", "node_modules", ".venv", "venv", ".idea", "dist", "build"}
IGNORE_FILES = {"*.pyc", "*.pyo", "*.min.js", "*.lock"}
def get_python_symbols(source: str) -> list[str]:
"""Extract top-level symbols from Python source."""
symbols = []
try:
tree = ast.parse(source)
for node in tree.body:
if isinstance(node, ast.ClassDef):
methods = [
f" +{m.name}()"
for m in ast.walk(node)
if isinstance(m, (ast.FunctionDef, ast.AsyncFunctionDef)) and m is not node
][:6]
symbols.append(f" class {node.name}:")
symbols.extend(methods)
elif isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
args = [a.arg for a in node.args.args if a.arg not in ("self", "cls")]
arg_str = ", ".join(args[:3]) + ("..." if len(args) > 3 else "")
symbols.append(f" def {node.name}({arg_str})")
except SyntaxError:
pass
return symbols
def build_repo_map(root: str, max_files: int = 150, task_keywords: list[str] = None) -> str:
"""
Build a compact repo map.
If task_keywords provided, prioritizes files likely relevant to the task.
"""
all_entries = []
root_path = Path(root)
for dirpath, dirnames, filenames in os.walk(root):
# Prune ignored directories
dirnames[:] = [d for d in dirnames if d not in IGNORE_DIRS and not d.startswith(".")]
for fname in sorted(filenames):
if not fname.endswith(".py"):
continue
filepath = os.path.join(dirpath, fname)
rel_path = os.path.relpath(filepath, root)
try:
source = Path(filepath).read_text(encoding="utf-8", errors="replace")
symbols = get_python_symbols(source)
# Relevance score for prioritization
score = 0
if task_keywords:
content_lower = source.lower()
score = sum(kw.lower() in content_lower for kw in task_keywords)
all_entries.append((score, rel_path, symbols))
except Exception:
continue
# Sort by relevance (highest first), then alphabetically
all_entries.sort(key=lambda x: (-x[0], x[1]))
lines = [f"Repository map for: {root}", "=" * 50]
count = 0
for score, rel_path, symbols in all_entries:
if count >= max_files:
remaining = len(all_entries) - count
lines.append(f"\n... and {remaining} more files (use search_files to find others)")
break
if symbols:
lines.append(f"\n{rel_path}")
lines.extend(symbols[:8]) # Cap symbols per file
else:
lines.append(f"{rel_path}")
count += 1
return "\n".join(lines)
Context Window Management
The agent's conversation history grows with every tool call. At some point, the history becomes so large that the LLM cannot reason effectively - it is processing thousands of tokens of old tool results when it only needs the recent ones.
"""
context_manager.py - Manage conversation history to stay within context limits.
"""
from typing import Any
def estimate_tokens(messages: list[dict]) -> int:
"""
Rough token estimate: ~4 characters per token.
This is a heuristic - actual count depends on tokenizer.
"""
total_chars = 0
for msg in messages:
content = msg.get("content", "")
if isinstance(content, str):
total_chars += len(content)
elif isinstance(content, list):
for block in content:
if isinstance(block, dict):
total_chars += len(str(block.get("content", "")))
total_chars += len(str(block.get("text", "")))
return total_chars // 4
def sliding_window_trim(
messages: list[dict],
max_tokens: int = 80_000,
keep_first: int = 1, # Always keep the initial user message
keep_last: int = 8, # Always keep the most recent N messages
) -> list[dict]:
"""
Trim message history using a sliding window.
Keeps the first message (initial task) and the most recent messages.
Drops middle messages when context gets too large.
"""
if estimate_tokens(messages) <= max_tokens:
return messages
# Always keep: first N + last N messages
if len(messages) <= keep_first + keep_last:
return messages
kept = messages[:keep_first] + messages[-(keep_last):]
# If still too large, be more aggressive
while estimate_tokens(kept) > max_tokens and len(kept) > keep_first + 2:
# Remove the oldest non-essential message
kept = [kept[0]] + kept[2:]
return kept
def summarize_history(
messages: list[dict],
client,
keep_last: int = 6,
) -> list[dict]:
"""
When context is very large, ask the LLM to summarize what it has done so far.
Replace old messages with the summary.
"""
if len(messages) <= keep_last + 1:
return messages
old_messages = messages[1:-keep_last] # Skip first and last N
recent_messages = messages[-keep_last:]
# Ask for a summary
summary_response = client.messages.create(
model="claude-haiku-4-5", # Use fast model for summarization
max_tokens=1000,
messages=[
{
"role": "user",
"content": (
"Summarize what this coding agent has done so far in 3-5 bullet points. "
"Include: which files were read, what changes were made, what tests were run, "
"and current status.\n\n"
"Messages to summarize:\n" +
str(old_messages)[:8000]
),
}
],
)
summary_text = summary_response.content[0].text
summary_message = {
"role": "user",
"content": f"[CONTEXT SUMMARY - Actions taken so far:]\n{summary_text}\n[END SUMMARY - continuing from here:]",
}
return [messages[0], summary_message] + recent_messages
System Prompt Design
The system prompt is not boilerplate. It determines how the agent behaves. Here is what an effective coding agent system prompt must communicate:
def build_system_prompt(
repo_path: str,
repo_map: str,
task_context: str = "",
) -> str:
"""Build the system prompt for a coding agent."""
return f"""You are an expert software engineer working on a codebase at: {repo_path}
## Your Approach
**Before any edit:**
1. Understand the full task
2. Use list_directory and the repo map to orient yourself
3. Use search_files or bash with grep to find relevant code
4. Read ALL relevant files before touching anything
**Making edits:**
1. Use edit_file for surgical changes (never rewrite whole files unnecessarily)
2. Include enough surrounding context in old_str to ensure uniqueness
3. Make the MINIMAL change - do not refactor, rename, or "improve" unrelated code
4. Verify with read_file after editing if the change is complex
**After every edit:**
1. Run tests immediately - use bash("pytest tests/relevant_test.py -v --tb=short")
2. If tests fail, read the FULL error message (especially the bottom - that's where the assertion is)
3. Fix the root cause, not the symptom
**When stuck:**
1. Re-read the failing test - understand what it is ACTUALLY checking
2. Re-read the implementation from scratch
3. Try a different approach entirely
4. If same approach fails 3 times, it is wrong - backtrack
## Codebase Overview
{repo_map}
{("## Task Context\n" + task_context) if task_context else ""}
## Rules
- Never mark a task complete without running tests first
- Never make edits speculatively - read first, understand, then edit
- Prefer edit_file over write_file for existing files
- If a bash command fails, read the error and adjust
- Keep changes minimal - only change what is necessary"""
The Complete 500-Line Coding Agent
Here is the full implementation with all features integrated. This is copy-paste runnable:
"""
coding_agent.py - Complete coding agent with repo map, context management,
streaming output, safety checks, and TDD loop.
Usage:
export ANTHROPIC_API_KEY=sk-ant-...
python coding_agent.py --repo /path/to/repo --task "Fix the bug in calculate_discount"
# With streaming output:
python coding_agent.py --repo . --task "Add type hints to users.py" --stream
# With max steps:
python coding_agent.py --repo . --task "..." --max-steps 40
"""
import argparse
import ast
import os
import re
import subprocess
import sys
from pathlib import Path
from typing import Any, Optional
import anthropic
# ─────────────────────────────────────────────────────────────────────────────
# Configuration
# ─────────────────────────────────────────────────────────────────────────────
MAX_FILE_SIZE = 500_000
MAX_OUTPUT_LENGTH = 10_000
IGNORE_DIRS = {".git", "__pycache__", "node_modules", ".venv", "venv", "dist", "build", ".mypy_cache"}
BLOCKED_COMMANDS = [
r"rm\s+-rf\s+/",
r":\(\)\{.*\}",
r"dd\s+.*of=/dev/",
r"mkfs\.",
]
# ─────────────────────────────────────────────────────────────────────────────
# Repo Map
# ─────────────────────────────────────────────────────────────────────────────
def _parse_python_symbols(source: str) -> list[str]:
symbols = []
try:
tree = ast.parse(source)
for node in tree.body:
if isinstance(node, ast.ClassDef):
symbols.append(f" class {node.name}:")
for child in node.body:
if isinstance(child, (ast.FunctionDef, ast.AsyncFunctionDef)):
args = [a.arg for a in child.args.args if a.arg not in ("self", "cls")]
symbols.append(f" + {child.name}({', '.join(args[:3])})")
elif isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
args = [a.arg for a in node.args.args if a.arg not in ("self", "cls")]
symbols.append(f" def {node.name}({', '.join(args[:3])})")
except SyntaxError:
pass
return symbols[:15] # Cap to avoid massive repo maps
def build_repo_map(root: str, task_keywords: list[str] = None, max_files: int = 120) -> str:
entries = []
root_path = Path(root)
for filepath in sorted(root_path.rglob("*.py")):
if any(d in filepath.parts for d in IGNORE_DIRS):
continue
rel = filepath.relative_to(root_path)
try:
source = filepath.read_text(encoding="utf-8", errors="replace")
except Exception:
continue
symbols = _parse_python_symbols(source)
score = sum(kw.lower() in source.lower() for kw in (task_keywords or []))
entries.append((score, str(rel), symbols))
entries.sort(key=lambda x: (-x[0], x[1]))
lines = []
for _, rel_path, symbols in entries[:max_files]:
lines.append(f"\n{rel_path}")
lines.extend(symbols)
remaining = len(entries) - max_files
if remaining > 0:
lines.append(f"\n... ({remaining} more files - use search_files to find them)")
return "\n".join(lines) if lines else "(no Python files found)"
# ─────────────────────────────────────────────────────────────────────────────
# Tool Implementations
# ─────────────────────────────────────────────────────────────────────────────
def _truncate(s: str, max_len: int = MAX_OUTPUT_LENGTH) -> str:
if len(s) <= max_len:
return s
return s[:max_len // 2] + f"\n\n[... {len(s) - max_len} chars truncated ...]\n\n" + s[-(max_len // 4):]
def tool_read_file(path: str, start_line: int = 1, end_line: Optional[int] = None) -> str:
p = Path(path)
if not p.exists():
return f"ERROR: File not found: {path}\nTip: Use list_directory to check what files exist."
if not p.is_file():
return f"ERROR: Not a file: {path}"
if p.stat().st_size > MAX_FILE_SIZE:
return (
f"ERROR: File too large ({p.stat().st_size:,} bytes). "
f"Use start_line/end_line to read sections. "
f"First use get_python_symbols to find the right line range."
)
try:
content = p.read_text(encoding="utf-8", errors="replace")
lines = content.splitlines()
s = max(0, start_line - 1)
e = min(len(lines), end_line) if end_line else len(lines)
numbered = "\n".join(f"{i+s+1:5d} | {line}" for i, line in enumerate(lines[s:e]))
return f"File: {path} ({len(lines)} lines total, showing {s+1}–{e})\n\n{numbered}"
except Exception as ex:
return f"ERROR: {ex}"
def tool_write_file(path: str, content: str) -> str:
p = Path(path)
p.parent.mkdir(parents=True, exist_ok=True)
p.write_text(content, encoding="utf-8")
return f"Wrote {len(content.splitlines())} lines to {path}"
def tool_edit_file(path: str, old_str: str, new_str: str) -> str:
p = Path(path)
if not p.exists():
return f"ERROR: File not found: {path}"
content = p.read_text(encoding="utf-8", errors="replace")
if old_str not in content:
# Show similar lines for debugging
first_line = old_str.split("\n")[0].strip()[:40]
similar = [
f" Line {i+1}: {line[:80]}"
for i, line in enumerate(content.split("\n"))
if first_line and first_line.lower() in line.lower()
][:5]
hint = ("\n\nSimilar lines:\n" + "\n".join(similar)) if similar else ""
return (
f"ERROR: old_str not found in {path}.{hint}\n"
"Fix: use read_file to get the exact current content, then copy precisely."
)
count = content.count(old_str)
if count > 1:
return f"ERROR: old_str appears {count} times. Add more context to make it unique."
p.write_text(content.replace(old_str, new_str, 1), encoding="utf-8")
return f"Edit applied: {len(old_str.splitlines())} line(s) → {len(new_str.splitlines())} line(s) in {path}"
def tool_bash(command: str, timeout: int = 30, working_dir: Optional[str] = None) -> str:
for pattern in BLOCKED_COMMANDS:
if re.search(pattern, command):
return f"ERROR: Command blocked (dangerous pattern): {command}"
timeout = min(max(1, timeout), 120)
try:
r = subprocess.run(
command, shell=True, capture_output=True, text=True,
timeout=timeout, cwd=working_dir,
env={**os.environ, "TERM": "dumb", "FORCE_COLOR": "0"},
)
out = (r.stdout or "") + (("\nSTDERR:\n" + r.stderr) if r.stderr else "")
if not out.strip():
out = f"(command completed, exit code {r.returncode})"
elif r.returncode != 0:
out += f"\n[Exit code: {r.returncode}]"
return _truncate(out)
except subprocess.TimeoutExpired:
return f"ERROR: Timed out after {timeout}s. Use a more specific command."
except Exception as ex:
return f"ERROR: {ex}"
def tool_search_files(
pattern: str,
directory: str = ".",
file_type: Optional[str] = None,
context_lines: int = 2,
) -> str:
# Try ripgrep first
cmd = ["rg", "-n", f"-A{context_lines}", f"-B{context_lines}"]
if file_type:
cmd.extend(["--type", file_type])
cmd.extend([pattern, directory])
try:
r = subprocess.run(cmd, capture_output=True, text=True, timeout=20)
if r.stdout:
return _truncate(r.stdout)
if not r.stdout and not r.stderr:
return f"No matches for: {pattern}"
except FileNotFoundError:
pass
# Fallback to grep
glob = f"--include=*.{file_type}" if file_type else "--include=*.py"
cmd = ["grep", "-rn", glob, f"-A{context_lines}", f"-B{context_lines}", pattern, directory]
r = subprocess.run(cmd, capture_output=True, text=True, timeout=20)
return _truncate(r.stdout) if r.stdout else f"No matches for: {pattern}"
def tool_list_directory(path: str, max_depth: int = 3) -> str:
p = Path(path)
if not p.exists():
return f"ERROR: Not found: {path}"
if not p.is_dir():
return f"ERROR: Not a directory: {path}"
lines = [str(p) + "/"]
def walk(d: Path, prefix: str, depth: int):
if depth > max_depth:
return
try:
items = sorted(d.iterdir(), key=lambda x: (x.is_file(), x.name))
except PermissionError:
return
visible = [i for i in items if not i.name.startswith(".") and i.name not in IGNORE_DIRS]
for i, item in enumerate(visible):
last = i == len(visible) - 1
lines.append(f"{prefix}{'└── ' if last else '├── '}{item.name}{'/' if item.is_dir() else ''}")
if item.is_dir():
walk(item, prefix + (" " if last else "│ "), depth + 1)
walk(p, "", 1)
return "\n".join(lines)
def tool_run_tests(test_path: str = ".", test_filter: Optional[str] = None, timeout: int = 120) -> str:
cmd = ["python", "-m", "pytest", test_path, "--tb=short", "-v", "--no-header"]
if test_filter:
cmd.extend(["-k", test_filter])
try:
r = subprocess.run(cmd, capture_output=True, text=True, timeout=timeout)
output = r.stdout + (r.stderr or "")
return _truncate(output)
except subprocess.TimeoutExpired:
return f"ERROR: Tests timed out after {timeout}s. Try test_filter to narrow the scope."
except Exception as ex:
return f"ERROR: {ex}"
def tool_git_status(repo_path: str = ".") -> str:
r = subprocess.run(["git", "status", "--short"], cwd=repo_path, capture_output=True, text=True)
return r.stdout or "(clean)"
def tool_git_diff(repo_path: str = ".", filepath: Optional[str] = None) -> str:
cmd = ["git", "diff"]
if filepath:
cmd.extend(["--", filepath])
r = subprocess.run(cmd, cwd=repo_path, capture_output=True, text=True)
return _truncate(r.stdout) if r.stdout else "No unstaged changes."
def tool_get_python_symbols(filepath: str) -> str:
p = Path(filepath)
if not p.exists():
return f"ERROR: Not found: {filepath}"
try:
source = p.read_text(encoding="utf-8", errors="replace")
tree = ast.parse(source)
except SyntaxError as e:
return f"Syntax error: {e}"
lines = [f"Symbols in {filepath}:"]
for node in ast.walk(tree):
if isinstance(node, ast.ClassDef):
lines.append(f"\nclass {node.name} [line {node.lineno}]")
for child in node.body:
if isinstance(child, (ast.FunctionDef, ast.AsyncFunctionDef)):
args = [a.arg for a in child.args.args if a.arg not in ("self", "cls")]
ret = f" -> {ast.unparse(child.returns)}" if child.returns else ""
lines.append(f" + {child.name}({', '.join(args[:4])}){ret} [line {child.lineno}]")
elif isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
# Top-level only - skip methods (already covered in class)
pass # Handled by parent class logic above
return "\n".join(lines) or f"No symbols found in {filepath}"
# ─────────────────────────────────────────────────────────────────────────────
# Tool Registry
# ─────────────────────────────────────────────────────────────────────────────
TOOL_SCHEMAS = [
{"name": "read_file", "description": "Read file contents with line numbers. Use start_line/end_line for large files.", "input_schema": {"type": "object", "properties": {"path": {"type": "string"}, "start_line": {"type": "integer"}, "end_line": {"type": "integer"}}, "required": ["path"]}},
{"name": "write_file", "description": "Write/create a file. For existing files, prefer edit_file.", "input_schema": {"type": "object", "properties": {"path": {"type": "string"}, "content": {"type": "string"}}, "required": ["path", "content"]}},
{"name": "edit_file", "description": "Surgical search-replace. old_str must appear EXACTLY ONCE. Use read_file first.", "input_schema": {"type": "object", "properties": {"path": {"type": "string"}, "old_str": {"type": "string"}, "new_str": {"type": "string"}}, "required": ["path", "old_str", "new_str"]}},
{"name": "bash", "description": "Run shell command. Use for: tests, grep, find, git, linting. Timeout max 120s.", "input_schema": {"type": "object", "properties": {"command": {"type": "string"}, "timeout": {"type": "integer"}, "working_dir": {"type": "string"}}, "required": ["command"]}},
{"name": "search_files", "description": "Search across files with ripgrep. Returns matches with context.", "input_schema": {"type": "object", "properties": {"pattern": {"type": "string"}, "directory": {"type": "string"}, "file_type": {"type": "string"}, "context_lines": {"type": "integer"}}, "required": ["pattern"]}},
{"name": "list_directory", "description": "List directory as a tree.", "input_schema": {"type": "object", "properties": {"path": {"type": "string"}, "max_depth": {"type": "integer"}}, "required": ["path"]}},
{"name": "run_tests", "description": "Run pytest. Use test_filter (pytest -k) to run specific tests.", "input_schema": {"type": "object", "properties": {"test_path": {"type": "string"}, "test_filter": {"type": "string"}, "timeout": {"type": "integer"}}, "required": []}},
{"name": "git_status", "description": "Show modified files.", "input_schema": {"type": "object", "properties": {"repo_path": {"type": "string"}}, "required": []}},
{"name": "git_diff", "description": "Show current changes as a diff.", "input_schema": {"type": "object", "properties": {"repo_path": {"type": "string"}, "filepath": {"type": "string"}}, "required": []}},
{"name": "get_python_symbols", "description": "Get all classes, functions, methods in a Python file with line numbers.", "input_schema": {"type": "object", "properties": {"filepath": {"type": "string"}}, "required": ["filepath"]}},
]
def dispatch_tool(name: str, inputs: dict[str, Any]) -> str:
handlers = {
"read_file": lambda i: tool_read_file(**i),
"write_file": lambda i: tool_write_file(**i),
"edit_file": lambda i: tool_edit_file(**i),
"bash": lambda i: tool_bash(**i),
"search_files": lambda i: tool_search_files(**i),
"list_directory": lambda i: tool_list_directory(**i),
"run_tests": lambda i: tool_run_tests(**i),
"git_status": lambda i: tool_git_status(**i),
"git_diff": lambda i: tool_git_diff(**i),
"get_python_symbols": lambda i: tool_get_python_symbols(**i),
}
fn = handlers.get(name)
if not fn:
return f"ERROR: Unknown tool '{name}'"
try:
return fn(inputs)
except TypeError as e:
return f"ERROR: Wrong parameters for {name}: {e}"
except Exception as e:
return f"ERROR in {name}: {e}"
# ─────────────────────────────────────────────────────────────────────────────
# Context Management
# ─────────────────────────────────────────────────────────────────────────────
def estimate_tokens(messages: list) -> int:
return sum(len(str(m)) for m in messages) // 4
def trim_context(messages: list, max_tokens: int = 90_000) -> list:
if estimate_tokens(messages) <= max_tokens:
return messages
# Keep first message (initial task) + last 8 messages
if len(messages) > 10:
trimmed = [messages[0]] + messages[-8:]
if estimate_tokens(trimmed) <= max_tokens:
return trimmed
# Aggressive trim
return [messages[0]] + messages[-4:]
# ─────────────────────────────────────────────────────────────────────────────
# Main Agent
# ─────────────────────────────────────────────────────────────────────────────
def build_system_prompt(repo_path: str, repo_map: str) -> str:
return f"""You are an expert software engineer. Your task is to modify code in: {repo_path}
## Workflow
1. UNDERSTAND: Read the task. Orient yourself with the repo map below.
2. FIND: Use search_files or bash(grep) to locate relevant code.
3. READ: Read ALL relevant files before making any edits. Never edit blind.
4. EDIT: Use edit_file for surgical changes. Keep diffs minimal.
5. VERIFY: Run tests immediately after every edit. Never skip this.
6. ITERATE: If tests fail, read the error carefully. Fix root cause, not symptom.
## Rules
- Read before write. Always.
- edit_file > write_file for existing files.
- Minimal changes - do not refactor unless asked.
- Run tests after EVERY non-trivial edit.
- If same approach fails 3 times, try something completely different.
- Done only when tests pass.
## Codebase Map
{repo_map}"""
class CodingAgent:
def __init__(
self,
repo_path: str,
model: str = "claude-opus-4-5",
max_steps: int = 30,
max_tokens_per_call: int = 4096,
context_window_limit: int = 90_000,
verbose: bool = True,
stream: bool = False,
):
self.repo_path = str(Path(repo_path).resolve())
self.model = model
self.max_steps = max_steps
self.max_tokens = max_tokens_per_call
self.context_limit = context_window_limit
self.verbose = verbose
self.stream = stream
self.client = anthropic.Anthropic()
self.step_count = 0
self.tool_call_count = 0
self.consecutive_test_failures = 0
def run(self, task: str) -> str:
"""Run the agent on a task. Returns the final response text."""
if self.verbose:
print(f"\nTask: {task}")
print(f"Repo: {self.repo_path}")
print("Building repo map...", end=" ", flush=True)
# Extract keywords from task for relevance scoring
keywords = [w for w in task.split() if len(w) > 4][:8]
repo_map = build_repo_map(self.repo_path, task_keywords=keywords)
if self.verbose:
print("done.")
system = build_system_prompt(self.repo_path, repo_map)
messages = [
{
"role": "user",
"content": (
f"Task: {task}\n\n"
f"Repository root: {self.repo_path}\n\n"
"Complete this task. Remember to:\n"
"1. Read relevant files before editing\n"
"2. Run tests after every edit\n"
"3. Keep changes minimal"
),
}
]
for step in range(self.max_steps):
self.step_count = step + 1
if self.verbose:
print(f"\n[Step {self.step_count}/{self.max_steps}]", end=" ", flush=True)
# Trim context if needed
messages = trim_context(messages, self.context_limit)
# Call the LLM
try:
response = self.client.messages.create(
model=self.model,
max_tokens=self.max_tokens,
system=system,
tools=TOOL_SCHEMAS,
messages=messages,
)
except anthropic.APIError as e:
if self.verbose:
print(f"\nAPI Error: {e}")
return f"API error: {e}"
# Print any text
for block in response.content:
if hasattr(block, "text") and block.text and self.verbose:
preview = block.text[:120].replace("\n", " ")
print(f"→ {preview}{'...' if len(block.text) > 120 else ''}")
messages.append({"role": "assistant", "content": response.content})
# Task complete
if response.stop_reason == "end_turn":
final = next(
(b.text for b in response.content if hasattr(b, "text") and b.text),
"Task completed.",
)
if self.verbose:
print(f"\n\nFinal answer:\n{final}")
return final
# Execute tools
if response.stop_reason == "tool_use":
tool_results = []
for block in response.content:
if block.type != "tool_use":
continue
self.tool_call_count += 1
name, inputs = block.name, block.input
if self.verbose:
if name == "bash":
desc = inputs.get("command", "")[:60]
elif name in ("read_file", "edit_file", "write_file"):
desc = inputs.get("path", "")
elif name == "search_files":
desc = inputs.get("pattern", "")[:40]
else:
desc = str(inputs)[:40]
print(f"{name}({desc})", end=" ", flush=True)
result = dispatch_tool(name, inputs)
# Track test failures for backtracking
if name == "run_tests":
if "passed" in result.lower() and "failed" not in result.lower().split("passed")[0]:
self.consecutive_test_failures = 0
else:
self.consecutive_test_failures += 1
if self.consecutive_test_failures >= 3:
result += (
"\n\n[AGENT NOTE: 3 consecutive test failures. "
"Consider a completely different approach. "
"Re-read the failing test from scratch.]"
)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result,
})
messages.append({"role": "user", "content": tool_results})
return (
f"Agent stopped after {self.max_steps} steps. "
f"({self.tool_call_count} tool calls made) "
"The task may be incomplete."
)
def print_stats(self):
print(f"\nStats: {self.step_count} steps, {self.tool_call_count} tool calls")
# ─────────────────────────────────────────────────────────────────────────────
# CLI
# ─────────────────────────────────────────────────────────────────────────────
def main():
parser = argparse.ArgumentParser(
description="Coding agent - fixes bugs, adds features, runs tests",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
python coding_agent.py --repo . --task "Fix the failing tests in test_users.py"
python coding_agent.py --repo /my/project --task "Add type hints to all functions in utils.py"
python coding_agent.py --repo . --task "Refactor DatabaseConnection to use connection pooling" --max-steps 40
""",
)
parser.add_argument("--repo", required=True, help="Repository root path")
parser.add_argument("--task", required=True, help="Task description")
parser.add_argument("--model", default="claude-opus-4-5", help="Model to use")
parser.add_argument("--max-steps", type=int, default=30, help="Maximum agent steps")
parser.add_argument("--quiet", action="store_true", help="Suppress step-by-step output")
args = parser.parse_args()
if not Path(args.repo).is_dir():
print(f"ERROR: Repository path does not exist: {args.repo}")
sys.exit(1)
agent = CodingAgent(
repo_path=args.repo,
model=args.model,
max_steps=args.max_steps,
verbose=not args.quiet,
)
result = agent.run(args.task)
if args.quiet:
print(result)
else:
agent.print_stats()
if __name__ == "__main__":
main()
Extending the Agent
Once you have the 500-line base, extending it is straightforward:
Add LSP Integration
# Install: pip install python-lsp-server
# Then add a tool that queries pyright or pylsp for type errors
def tool_check_types(filepath: str) -> str:
"""Run pyright type checker on a file."""
r = subprocess.run(
["pyright", "--outputjson", filepath],
capture_output=True, text=True, timeout=30,
)
try:
data = json.loads(r.stdout)
errors = data.get("generalDiagnostics", [])
if not errors:
return f"No type errors in {filepath}"
lines = [f"Type errors in {filepath}:"]
for e in errors[:10]:
lines.append(f" Line {e['range']['start']['line']+1}: {e['message']}")
return "\n".join(lines)
except Exception:
return r.stdout or r.stderr
Add Web Search
def tool_search_docs(query: str) -> str:
"""Search documentation when the agent needs external knowledge."""
# Use Brave Search API or similar
import urllib.request, json
url = f"https://api.search.brave.com/res/v1/web/search?q={urllib.parse.quote(query)}&count=3"
req = urllib.request.Request(url, headers={"Accept": "application/json", "X-Subscription-Token": os.environ["BRAVE_API_KEY"]})
with urllib.request.urlopen(req) as response:
data = json.loads(response.read())
results = data.get("web", {}).get("results", [])
return "\n\n".join(f"{r['title']}\n{r['url']}\n{r.get('description', '')}" for r in results[:3])
Add Git Integration for Safe Edits
def run_with_git_safety(agent: CodingAgent, task: str) -> str:
"""
Run an agent with automatic git checkpointing.
Creates a checkpoint branch before starting; restores on failure.
"""
repo = agent.repo_path
branch = f"agent/{task[:30].replace(' ', '-').lower()}-{int(time.time())}"
# Create a branch
subprocess.run(["git", "checkout", "-b", branch], cwd=repo)
try:
result = agent.run(task)
# Commit the changes
subprocess.run(["git", "add", "-A"], cwd=repo)
subprocess.run(["git", "commit", "-m", f"Agent: {task[:50]}"], cwd=repo)
return result
except Exception as e:
# Restore original state
subprocess.run(["git", "checkout", "."], cwd=repo)
raise
:::tip Start with the 100-line version When building a coding agent for a new use case, always start with the minimal 100-line version. Get it working. Identify the specific failure modes in your domain. Then add features - repo map, context trimming, backtracking - only where you have observed the agent failing without them. Don't over-engineer upfront. :::
:::warning Context trimming is lossy When you trim the message history, you lose information. The agent may forget files it read earlier, or duplicate work it has already done. Always trim conservatively - keep as much history as you can afford. Monitor the agent for repetitive behavior (reading the same file multiple times) which is a symptom of over-trimming. :::
:::danger Never run in production without sandboxing
The agent has a bash tool. Pointed at a real production repository on a real server, an unconstrained bash tool is a liability. Always run in a container with limited filesystem access, no production credentials, no network access to internal services, and a non-root user. Test with a copy of the repository, not the original.
:::
Interview Q&A
Q: What are the most important architecture decisions when building a coding agent from scratch?
A: Five key decisions: (1) Raw API vs framework - start raw to understand the fundamentals, add frameworks later when you have specific needs; (2) Tool selection - minimum viable is read, edit, bash; add search and git for production use; (3) Context strategy - repo map for medium codebases, retrieval/embeddings for very large ones; (4) Edit strategy - search-replace is the right default for 95% of cases; (5) Error handling - informative tool error messages directly determine how well the agent self-corrects. Getting these right is more important than model selection.
Q: How do you build a repo map and why is it important?
A: A repo map is a compact index of a codebase: file paths plus the class names, function names, and method signatures extracted via AST parsing. It gives the LLM navigational awareness of the full codebase - it knows what exists and where - without reading every file in full. This is critical because most codebases are too large to fit in a context window. A 200-file repo map might cost 5,000 tokens; reading all 200 files might cost 200,000 tokens. The repo map lets the agent identify the 3-4 files it actually needs, read those, and ignore the rest.
Q: How should a coding agent handle context window exhaustion?
A: Two main strategies: sliding window and summarization. Sliding window keeps the first message (the original task) and the most recent N messages, dropping everything in between. This is fast and simple but loses intermediate context. Summarization asks a fast model to compress the dropped messages into a brief summary, which is prepended before the recent messages - this preserves more semantic content at higher cost. In practice, start with sliding window. If agents show "forgetting" behavior (re-reading files they already read, rediscovering information), add summarization for the dropped portion.
Q: What makes a good system prompt for a coding agent?
A: A good system prompt specifies the workflow (read → find → edit → verify), the rules (read before write, run tests after every edit, minimal diffs), the failure handling (backtrack after 3 failures, try different approaches), and the codebase context (repo map, repo path). It should not be generic - it should be specific to the coding task. Critically, it must make "run tests" a required step rather than an optional one. Agents that can rationalize skipping tests will skip them; the system prompt must treat test execution as non-negotiable.
Q: How do you make a coding agent safe enough to run in an automated pipeline?
A: Defense in depth: (1) Docker container with a non-root user and limited capabilities; (2) Read-only bind mount for everything except the target repository; (3) No production credentials in the environment; (4) Network isolation (no egress except to needed package registries); (5) Command-level blocking for obviously dangerous patterns; (6) Git checkpoint before the agent starts, so you can rollback; (7) Human review of the diff before merging; (8) Timeout limits on all tool calls; (9) Step limit on the agent loop; (10) Logging every tool call for audit. No single layer is sufficient - all of them together create acceptable risk for most use cases.
