Human Oversight Mechanisms
Reading time: 28 min | Relevance: Principal Engineer, AI Safety, Agent Platform Teams
The Rubber-Stamp Problem
Picture this: your agent is deploying infrastructure. It pauses 40 times asking for approval. By approval 12, the on-call engineer is clicking "Approve" without reading. By approval 30, they have set up an auto-approver script.
You have engineered the worst of both worlds: the friction of human oversight without its safety benefits.
Fully autonomous agents are not yet safe enough for high-stakes tasks. The question is not whether to have human oversight - it is how to design it so humans provide meaningful oversight without becoming a bottleneck. A poorly designed oversight system does not reduce risk. It redistributes it to a human who is now responsible for decisions they cannot properly evaluate.
This lesson is about making oversight real, not theatrical. The engineering problems are: when to interrupt, what to show, how to get a genuine decision, and how to know if your oversight system is actually working.
:::tip 🎮 Interactive Playground Visualize this concept: Try the Human-in-the-Loop Agents demo on the EngineersOfAI Playground - no code required. :::
Why This Exists
Early AI systems had no notion of oversight. They ran to completion or crashed. As AI systems became more capable and agentic, the gap between capability and safety engineering widened.
The 2023 surge in autonomous agents revealed a systemic problem: teams shipped agents that could take irreversible actions - sending emails, deleting records, committing code, spending money - without any human check. When these systems failed, and they did fail, there was no recovery path and no clear accountability.
Anthropic's safety guidelines for agents explicitly call out the need for human oversight, particularly for actions that cannot be reversed, actions affecting parties outside the agent's principal hierarchy, high-confidence but high-consequence actions, and significant resource expenditure.
The research on human-in-the-loop systems from aviation, nuclear plant monitoring, and medical diagnosis points consistently to the same failure mode: when oversight is too frequent or incomprehensible, humans stop providing it. The design challenge is not adding approvals everywhere but adding them precisely where they prevent harm.
The Oversight Spectrum
Human oversight is not binary. It exists on a spectrum from fully autonomous to fully manual:
Most well-designed production agents sit at risk-based interruption: autonomous for low-risk actions, pausing for high-risk or ambiguous ones. The right position on this spectrum depends on:
- Task stakes: what is the worst-case consequence of an error?
- Action reversibility: can this be undone in under an hour?
- Agent confidence: is the model certain about what to do?
- Domain maturity: does the agent have a track record in this domain?
- Principal authority: has the user explicitly granted broad or narrow authority?
Action Risk Classification
The first engineering task is classifying actions by risk. A principled taxonomy:
| Risk Level | Action Category | Default Policy | Example |
|---|---|---|---|
| Low | Read, search, compute | Auto-execute and log | Read file, search web, calculate |
| Medium | Write, create, update | Log and notify, configurable pause | Write file, create record, update config |
| High | Delete, send, execute, spend | Require confirmation | Delete record, send email, run script |
| Critical | Broad irreversible actions | Require explicit approval with justification | Drop database, wire transfer, publish post |
The risk level must be computed, not hardcoded. Risk is contextual:
- Writing to
/tmpis low risk; writing to/etc/passwdis critical - Sending a Slack message is medium risk; sending to an external email list of 10,000 is high
- Spending 10,000 requires explicit confirmation
from enum import Enum
from dataclasses import dataclass, field
from typing import Any, Optional
from datetime import datetime, timezone
import json
import uuid
import asyncio
class RiskLevel(Enum):
LOW = "low"
MEDIUM = "medium"
HIGH = "high"
CRITICAL = "critical"
@dataclass
class RiskAssessment:
level: RiskLevel
score: float # 0.0 to 1.0
reasons: list[str]
reversible: bool
requires_approval: bool
class ActionRiskClassifier:
"""
Classifies agent actions by risk level using rule-based scoring.
The classifier produces a risk score between 0.0 and 1.0 by starting
from a base score per action type and adding context-specific penalties.
"""
DANGEROUS_PATHS = ["/etc", "/usr", "/bin", "/sbin", "/boot", "/root", "/proc", "/sys"]
SENSITIVE_EXTENSIONS = [".env", ".pem", ".key", ".crt", ".passwd", ".shadow", ".secret"]
# Base risk scores per tool type
BASE_SCORES: dict[str, float] = {
"read_file": 0.1,
"search_web": 0.05,
"calculate": 0.0,
"list_directory": 0.05,
"write_file": 0.4,
"create_record": 0.35,
"update_record": 0.3,
"append_file": 0.25,
"delete_file": 0.7,
"delete_record": 0.8,
"send_email": 0.65,
"execute_command": 0.75,
"http_post": 0.5,
"http_delete": 0.8,
"http_patch": 0.45,
"spend_money": 0.9,
"publish": 0.85,
"deploy": 0.8,
}
def classify(self, tool_name: str, tool_args: dict[str, Any]) -> RiskAssessment:
score = 0.0
reasons: list[str] = []
reversible = True
base = self.BASE_SCORES.get(tool_name, 0.5)
score += base
reasons.append(f"Base action type '{tool_name}' score: {base:.2f}")
# Context modifiers for filesystem operations
if tool_name in ("write_file", "delete_file", "append_file"):
path = str(tool_args.get("path", ""))
if any(path.startswith(p) for p in self.DANGEROUS_PATHS):
score += 0.4
reasons.append(f"Sensitive system path: {path}")
reversible = False
if any(path.endswith(ext) for ext in self.SENSITIVE_EXTENSIONS):
score += 0.2
reasons.append(f"Sensitive file type detected in: {path}")
if tool_name == "delete_file":
reversible = False
# Context modifiers for email/messaging
if tool_name == "send_email":
recipients = tool_args.get("to", [])
if isinstance(recipients, list):
count = len(recipients)
else:
count = len(str(recipients).split(","))
if count > 100:
score += 0.4
reasons.append(f"Mass email to {count} recipients - cannot be recalled")
reversible = False
elif count > 10:
score += 0.2
reasons.append(f"Bulk email to {count} recipients")
elif "external" in str(tool_args.get("to", "")).lower():
score += 0.1
reasons.append("Sending to external recipient")
# Context modifiers for shell execution
if tool_name == "execute_command":
cmd = str(tool_args.get("command", ""))
dangerous_patterns = [
("rm -rf", "Recursive deletion command"),
("dd if=", "Disk write command"),
("mkfs", "Filesystem format command"),
("DROP TABLE", "SQL table drop"),
("DROP DATABASE", "SQL database drop"),
("chmod 777", "Broad permission grant"),
("> /dev", "Device write"),
("curl | bash", "Piped remote execution"),
("wget | sh", "Piped remote execution"),
]
for pattern, description in dangerous_patterns:
if pattern.lower() in cmd.lower():
score += 0.4
reasons.append(f"Dangerous pattern detected: {description}")
reversible = False
# Context modifiers for financial operations
if tool_name == "spend_money":
amount = float(tool_args.get("amount", 0))
if amount > 10000:
score += 0.5
reasons.append(f"Very large expenditure: ${amount:,.2f}")
reversible = False
elif amount > 1000:
score += 0.3
reasons.append(f"Large expenditure: ${amount:,.2f}")
reversible = False
elif amount > 100:
score += 0.15
reasons.append(f"Moderate expenditure: ${amount:,.2f}")
# Cap and classify
score = min(score, 1.0)
if score < 0.3:
level = RiskLevel.LOW
elif score < 0.55:
level = RiskLevel.MEDIUM
elif score < 0.8:
level = RiskLevel.HIGH
else:
level = RiskLevel.CRITICAL
requires_approval = level in (RiskLevel.HIGH, RiskLevel.CRITICAL)
return RiskAssessment(
level=level,
score=round(score, 3),
reasons=reasons,
reversible=reversible,
requires_approval=requires_approval,
)
Meaningful vs Rubber-Stamp Oversight
Research on human oversight of automated systems - including aviation autopilot monitoring, nuclear plant control rooms, and medical diagnosis support - consistently shows the same pathology: oversight degrades with frequency and incomprehensibility.
Studies of human-in-the-loop AI systems reveal:
- When approval requests arrive more than once per minute, approval rates approach 100% regardless of content
- When users cannot understand what they are approving, they default to approval 91% of the time (Cummings 2014, MIT CSAIL)
- When the cost of denial is unclear, approval is the path of least resistance
- Alert fatigue is documented as a direct contributor to medical errors in ICU monitoring systems
The design implications for agent approval interfaces:
- Show less, not more: present the single most important fact about why this needs approval
- Make denial easy: one-click deny with auto-suggested alternatives
- Default to safe on timeout: if no response in N seconds, deny or pause
- Explain consequence, not mechanism: "This will permanently delete 3,400 customer records" not the raw SQL query
- Batch where possible: approve a multi-step plan once, not individual steps
Oversight Decision Flowchart
The Approval Request Schema
@dataclass
class ApprovalRequest:
"""
Everything a human needs to make a genuine approval decision.
Design principle: show consequences not mechanics.
The human should understand WHAT will happen, not HOW the code does it.
"""
request_id: str
agent_id: str
task_id: str
# What the human sees (plain language)
action_description: str # "Delete 3,400 customer records from orders table"
consequence_if_approved: str # "These records cannot be recovered without a database backup"
consequence_if_denied: str # "The cleanup task will fail; records remain as-is"
# For power users who want to dig in (shown collapsed by default)
technical_detail: str # Full JSON tool call
risk_level: RiskLevel
risk_reasons: list[str] # Why this was flagged
reversible: bool
created_at: datetime
timeout_seconds: int = 300 # 5 minutes default
default_on_timeout: str = "deny" # NEVER default to approve
# Context links
task_summary: str = "" # What is the agent overall trying to do
context_url: str = "" # Link to full task audit trail
def to_slack_message(self) -> dict:
"""Format for Slack block kit."""
risk_emoji = {
RiskLevel.LOW: ":white_check_mark:",
RiskLevel.MEDIUM: ":large_yellow_circle:",
RiskLevel.HIGH: ":warning:",
RiskLevel.CRITICAL: ":rotating_light:",
}
return {
"blocks": [
{
"type": "header",
"text": {
"type": "plain_text",
"text": f"{risk_emoji[self.risk_level]} Agent Action Requires Approval"
}
},
{
"type": "section",
"fields": [
{"type": "mrkdwn", "text": f"*Risk Level*\n{self.risk_level.value.upper()}"},
{"type": "mrkdwn", "text": f"*Reversible*\n{'Yes' if self.reversible else 'NO - cannot be undone'}"},
]
},
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": (
f"*Action:*\n{self.action_description}\n\n"
f"*If approved:* {self.consequence_if_approved}\n"
f"*If denied:* {self.consequence_if_denied}\n\n"
f"*Why flagged:*\n" + "\n".join(f"• {r}" for r in self.risk_reasons) +
f"\n\n*Timeout:* {self.timeout_seconds}s - default on timeout: *{self.default_on_timeout}*"
)
}
},
{
"type": "actions",
"elements": [
{
"type": "button",
"text": {"type": "plain_text", "text": "Approve"},
"style": "primary",
"value": f"approve:{self.request_id}",
"action_id": "approve_action",
"confirm": {
"title": {"type": "plain_text", "text": "Confirm Approval"},
"text": {
"type": "mrkdwn",
"text": f"You are approving: *{self.action_description}*\n\n"
f"Consequence: {self.consequence_if_approved}"
},
"confirm": {"type": "plain_text", "text": "Yes, Approve"},
"deny": {"type": "plain_text", "text": "Cancel"},
}
},
{
"type": "button",
"text": {"type": "plain_text", "text": "Deny"},
"style": "danger",
"value": f"deny:{self.request_id}",
"action_id": "deny_action",
},
]
}
]
}
Full Async Approval System
import anthropic
class HumanOversightSystem:
"""
Production human oversight system with:
- Risk-based action classification
- Async approval queue with timeout
- Slack notification integration
- Immutable audit trail
- Agent state persistence for resume-after-approval
- Graduated autonomy tracking
"""
def __init__(self, slack_webhook_url: Optional[str] = None):
self.classifier = ActionRiskClassifier()
self.pending_approvals: dict[str, ApprovalRequest] = {}
self.approval_results: dict[str, bool] = {}
self.audit_log: list[dict] = []
self.slack_webhook_url = slack_webhook_url
self.autonomy = GraduatedAutonomyManager()
self.client = anthropic.Anthropic()
async def check_and_execute(
self,
tool_name: str,
tool_args: dict,
agent_id: str,
task_id: str,
executor_fn,
) -> dict:
"""
Route an action through the oversight system.
Returns dict with keys:
status: "executed" | "skipped" | "awaiting_approval"
result: tool output or explanation string
"""
assessment = self.classifier.classify(tool_name, tool_args)
# Graduated autonomy can override base risk level
requires_approval = assessment.requires_approval and \
self.autonomy.requires_approval(tool_name, assessment.level)
self._log_audit("action_proposed", {
"agent_id": agent_id,
"task_id": task_id,
"tool_name": tool_name,
"tool_args": tool_args,
"risk_level": assessment.level.value,
"risk_score": assessment.score,
"requires_approval": requires_approval,
})
if not requires_approval:
try:
result = await executor_fn(tool_name, tool_args)
self.autonomy.record_success(tool_name)
self._log_audit("action_auto_executed", {
"task_id": task_id,
"tool_name": tool_name,
"risk_level": assessment.level.value,
})
return {"status": "executed", "result": result}
except Exception as e:
self.autonomy.record_incident(tool_name, "minor")
self._log_audit("action_failed", {
"task_id": task_id,
"tool_name": tool_name,
"error": str(e),
})
return {"status": "error", "result": f"Tool execution failed: {e}"}
# High/critical: queue approval request
approval = self._build_approval_request(
tool_name, tool_args, assessment, agent_id, task_id
)
self.pending_approvals[approval.request_id] = approval
if self.slack_webhook_url:
await self._send_slack_notification(approval)
self._log_audit("approval_requested", {
"request_id": approval.request_id,
"task_id": task_id,
"tool_name": tool_name,
"risk_level": assessment.level.value,
"timeout_seconds": approval.timeout_seconds,
})
approved = await self._wait_for_approval(approval)
if approved:
try:
result = await executor_fn(tool_name, tool_args)
self.autonomy.record_success(tool_name)
self._log_audit("action_approved_executed", {
"request_id": approval.request_id,
"task_id": task_id,
"tool_name": tool_name,
})
return {"status": "executed", "result": result}
except Exception as e:
self.autonomy.record_incident(tool_name, "medium")
return {"status": "error", "result": f"Approved but execution failed: {e}"}
else:
self._log_audit("action_denied", {
"request_id": approval.request_id,
"task_id": task_id,
"tool_name": tool_name,
})
return {
"status": "skipped",
"result": (
f"Action '{tool_name}' was not approved. "
"Please proceed with an alternative approach or ask the user for guidance."
),
}
def _build_approval_request(
self,
tool_name: str,
tool_args: dict,
assessment: RiskAssessment,
agent_id: str,
task_id: str,
) -> ApprovalRequest:
# Plain-language descriptions for common tools
descriptions = {
"delete_file": f"Permanently delete file: {tool_args.get('path', 'unknown')}",
"send_email": (
f"Send email to {tool_args.get('to', 'unknown')} "
f"with subject: '{tool_args.get('subject', '')[:60]}'"
),
"execute_command": f"Execute shell command: `{tool_args.get('command', '')[:100]}`",
"delete_record": (
f"Delete {tool_args.get('count', 'unknown')} records "
f"from {tool_args.get('table', 'unknown table')}"
),
"spend_money": (
f"Charge ${tool_args.get('amount', 0):,.2f} "
f"to {tool_args.get('payment_method', 'configured payment method')}"
),
"deploy": f"Deploy {tool_args.get('service', 'unknown service')} to {tool_args.get('environment', 'production')}",
"publish": f"Publish content: '{str(tool_args.get('title', ''))[:60]}'",
}
action_description = descriptions.get(
tool_name,
f"Execute {tool_name}: {json.dumps(tool_args)[:120]}"
)
consequence_approved = (
"This action CANNOT be undone. Approve only if you are certain."
if not assessment.reversible
else "This action can be reversed if needed."
)
return ApprovalRequest(
request_id=str(uuid.uuid4()),
agent_id=agent_id,
task_id=task_id,
action_description=action_description,
consequence_if_approved=consequence_approved,
consequence_if_denied="The agent will skip this action. The overall task may fail or need a different approach.",
technical_detail=json.dumps({"tool": tool_name, "args": tool_args}, indent=2),
risk_level=assessment.level,
risk_reasons=assessment.reasons,
reversible=assessment.reversible,
created_at=datetime.now(timezone.utc),
timeout_seconds=300 if assessment.level == RiskLevel.HIGH else 600,
default_on_timeout="deny",
)
async def _send_slack_notification(self, approval: ApprovalRequest):
"""Post to Slack using incoming webhook."""
try:
import aiohttp
async with aiohttp.ClientSession() as session:
await session.post(
self.slack_webhook_url,
json=approval.to_slack_message(),
timeout=aiohttp.ClientTimeout(total=10),
)
except Exception as e:
# Notification failure must NOT block the approval flow
self._log_audit("slack_notification_failed", {"error": str(e)})
async def _wait_for_approval(self, approval: ApprovalRequest) -> bool:
"""Poll for human response until timeout, then apply default action."""
deadline = asyncio.get_event_loop().time() + approval.timeout_seconds
while asyncio.get_event_loop().time() < deadline:
if approval.request_id in self.approval_results:
result = self.approval_results.pop(approval.request_id)
self.pending_approvals.pop(approval.request_id, None)
return result
await asyncio.sleep(2)
# Timeout expired
self.pending_approvals.pop(approval.request_id, None)
self._log_audit("approval_timeout", {
"request_id": approval.request_id,
"default_action": approval.default_on_timeout,
"timeout_seconds": approval.timeout_seconds,
})
return approval.default_on_timeout == "approve"
def submit_human_decision(
self,
request_id: str,
approved: bool,
approver_id: str = "unknown",
reason: str = "",
):
"""
Called by your webhook handler (Slack, web UI, API) when a human responds.
This is the bridge between the async approval queue and human interfaces.
"""
self.approval_results[request_id] = approved
self._log_audit("human_decision_recorded", {
"request_id": request_id,
"approved": approved,
"approver_id": approver_id,
"reason": reason,
})
def _log_audit(self, event_type: str, data: dict):
"""Append immutable audit entry."""
entry = {
"timestamp": datetime.now(timezone.utc).isoformat(),
"event": event_type,
**data,
}
self.audit_log.append(entry)
# In production: also write to append-only log store (S3, CloudWatch, Loki)
print(f"[AUDIT] {event_type}: {json.dumps({k: v for k, v in data.items() if k != 'tool_args'})}")
def get_audit_trail(self, task_id: str) -> list[dict]:
return [e for e in self.audit_log if e.get("task_id") == task_id]
def get_oversight_stats(self) -> dict:
"""Return statistics to monitor oversight health."""
total = len(self.audit_log)
approvals_requested = sum(1 for e in self.audit_log if e["event"] == "approval_requested")
approvals_approved = sum(1 for e in self.audit_log if e["event"] == "action_approved_executed")
timeouts = sum(1 for e in self.audit_log if e["event"] == "approval_timeout")
approval_rate = (
approvals_approved / approvals_requested
if approvals_requested > 0 else None
)
return {
"total_events": total,
"approvals_requested": approvals_requested,
"approvals_approved": approvals_approved,
"approval_rate": f"{approval_rate:.0%}" if approval_rate is not None else "N/A",
"timeouts": timeouts,
"autonomy_status": self.autonomy.get_status_report(),
"warning": (
"HIGH APPROVAL RATE - possible rubber-stamping. Review interface design."
if approval_rate and approval_rate > 0.85
else "OK"
),
}
Graduated Autonomy
@dataclass
class DomainTrackRecord:
"""Tracks agent performance in a specific action domain."""
domain: str
total_actions: int = 0
successful_actions: int = 0
human_overrides: int = 0
last_incident: Optional[datetime] = None
autonomy_level: float = 0.0 # 0.0 = full oversight, 1.0 = full autonomy
class GraduatedAutonomyManager:
"""
Systematically expands what the agent can do without approval
as it builds a track record of safe, accurate actions in each domain.
Design rationale: a file-deletion agent that has correctly deleted
10,000 files without any incident has earned higher autonomy than
one in its first hour of operation. Static approval requirements
don't account for this.
"""
DOMAIN_MAP: dict[str, str] = {
"read_file": "file_read", "list_directory": "file_read",
"write_file": "file_write", "append_file": "file_write",
"delete_file": "file_delete",
"send_email": "email", "send_slack": "messaging",
"execute_command": "shell", "deploy": "deployment",
"http_post": "http_write", "http_patch": "http_write",
"http_delete": "http_delete",
"delete_record": "db_delete",
"spend_money": "financial",
}
# Minimum autonomy required to skip approval at each risk level
APPROVAL_THRESHOLDS: dict[RiskLevel, float] = {
RiskLevel.LOW: 0.0, # Never requires approval
RiskLevel.MEDIUM: 0.5, # Auto-approve when autonomy > 50%
RiskLevel.HIGH: 0.85, # Auto-approve when autonomy > 85%
RiskLevel.CRITICAL: 1.1, # Always requires approval (threshold impossible to reach)
}
def __init__(self):
self.records: dict[str, DomainTrackRecord] = {}
def _get_record(self, tool_name: str) -> DomainTrackRecord:
domain = self.DOMAIN_MAP.get(tool_name, "general")
if domain not in self.records:
self.records[domain] = DomainTrackRecord(domain=domain)
return self.records[domain]
def requires_approval(self, tool_name: str, base_risk: RiskLevel) -> bool:
record = self._get_record(tool_name)
threshold = self.APPROVAL_THRESHOLDS[base_risk]
return record.autonomy_level < threshold
def record_success(self, tool_name: str):
"""Increments autonomy using asymptotic growth toward 1.0."""
record = self._get_record(tool_name)
record.total_actions += 1
record.successful_actions += 1
# Diminishing returns: each success contributes less as autonomy approaches 1.0
increment = 0.01 * (1.0 - record.autonomy_level)
record.autonomy_level = min(1.0, record.autonomy_level + increment)
def record_incident(self, tool_name: str, severity: str = "medium"):
"""Reduces autonomy after an incident. Severity determines how far it resets."""
record = self._get_record(tool_name)
record.human_overrides += 1
record.last_incident = datetime.now(timezone.utc)
reductions = {"minor": 0.15, "medium": 0.4, "major": 0.8, "critical": 1.0}
reduction = reductions.get(severity, 0.4)
record.autonomy_level = max(0.0, record.autonomy_level - reduction)
def get_status_report(self) -> dict:
return {
domain: {
"autonomy": f"{record.autonomy_level:.0%}",
"total_actions": record.total_actions,
"success_rate": (
f"{record.successful_actions / record.total_actions:.0%}"
if record.total_actions > 0 else "N/A"
),
"incidents": record.human_overrides,
"last_incident": (
record.last_incident.isoformat()
if record.last_incident else "None"
),
}
for domain, record in self.records.items()
}
Full Agent Integration
async def run_overseen_agent(
task: str,
agent_id: str = "agent-001",
slack_webhook_url: Optional[str] = None,
):
"""
Complete agent that routes every action through the human oversight system.
Demonstrates: risk classification, approval requests, graduated autonomy,
agent state persistence, and audit trail.
"""
client = anthropic.Anthropic()
oversight = HumanOversightSystem(slack_webhook_url=slack_webhook_url)
task_id = str(uuid.uuid4())
tools = [
{
"name": "read_file",
"description": "Read the contents of a file",
"input_schema": {
"type": "object",
"properties": {
"path": {"type": "string", "description": "Absolute file path to read"}
},
"required": ["path"],
},
},
{
"name": "delete_file",
"description": "Permanently delete a file from the filesystem",
"input_schema": {
"type": "object",
"properties": {
"path": {"type": "string", "description": "Absolute file path to delete"}
},
"required": ["path"],
},
},
{
"name": "send_email",
"description": "Send an email to one or more recipients",
"input_schema": {
"type": "object",
"properties": {
"to": {"type": "string", "description": "Recipient email address"},
"subject": {"type": "string", "description": "Email subject line"},
"body": {"type": "string", "description": "Email body (plain text)"},
},
"required": ["to", "subject", "body"],
},
},
{
"name": "execute_command",
"description": "Execute a shell command and return stdout/stderr",
"input_schema": {
"type": "object",
"properties": {
"command": {"type": "string", "description": "Shell command to execute"},
},
"required": ["command"],
},
},
]
async def execute_tool(tool_name: str, tool_args: dict) -> str:
"""Simulated tool executor - replace with real implementations in production."""
print(f" [EXECUTING] {tool_name}({json.dumps(tool_args)[:80]})")
await asyncio.sleep(0.1)
return f"Success: {tool_name} completed with args {tool_args}"
messages = [{"role": "user", "content": task}]
print(f"\n[AGENT] Task {task_id} started: {task[:80]}")
for step in range(25):
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=4096,
system=(
"You are a file management and operations agent. You can read files, "
"delete files, send emails, and execute commands. "
"Always explain what you are about to do before doing it. "
"For irreversible actions, state clearly why you believe this is the right action."
),
tools=tools,
messages=messages,
)
messages.append({"role": "assistant", "content": response.content})
if response.stop_reason == "end_turn":
print("\n[AGENT] Task completed.")
break
if response.stop_reason != "tool_use":
break
tool_results = []
for block in response.content:
if block.type != "tool_use":
continue
print(f"\n[AGENT step {step+1}] Requesting: {block.name}")
outcome = await oversight.check_and_execute(
tool_name=block.name,
tool_args=block.input,
agent_id=agent_id,
task_id=task_id,
executor_fn=execute_tool,
)
print(f" Status: {outcome['status']}")
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": outcome["result"],
})
messages.append({"role": "user", "content": tool_results})
# Final oversight health check
stats = oversight.get_oversight_stats()
print("\n[OVERSIGHT STATS]")
print(json.dumps(stats, indent=2))
audit = oversight.get_audit_trail(task_id)
print(f"\n[AUDIT] {len(audit)} events recorded for task {task_id}")
return audit
if __name__ == "__main__":
asyncio.run(run_overseen_agent(
"Clean up log files older than 30 days in /var/log, "
))
Accountability Chains
When an agent takes an action that a human approved, the accountability question has legal and organizational implications.
The chain of responsibility:
Agent action → Approved by Human X → Interface designed by Engineer Y → Deployed by Org Z
- The approving human bears responsibility for what they approved - but only if the interface gave them genuine ability to understand the decision. Approving a cryptic JSON blob does not confer accountability.
- The system designer is responsible for whether the approval interface provided meaningful information. If the human could not reasonably understand what they were approving, the designer retains accountability.
- The deploying organization is responsible for whether the oversight system was calibrated appropriately for the risk level of the actions taken.
Practical implication: your approval interface is a legal artefact. In any incident investigation, regulators and courts will ask: "Did the human who approved this have sufficient information to make a real decision?" Design your interfaces with that question in mind.
Monitoring Your Oversight System
def compute_oversight_health(oversight: HumanOversightSystem) -> dict:
"""
Monthly oversight health check.
Run this after sampling 50 recent approvals and reviewing them manually.
Key metrics to watch:
- approval_rate > 0.85 on HIGH/CRITICAL = probable rubber-stamping
- avg_decision_time_seconds < 10 = human is not reading carefully
- timeout_rate > 0.20 = humans aren't being reached
"""
stats = oversight.get_oversight_stats()
issues = []
if stats.get("warning") != "OK":
issues.append(stats["warning"])
if stats.get("timeouts", 0) > 0.2 * stats.get("approvals_requested", 1):
issues.append(
"More than 20% of approvals are timing out. "
"Improve notification channels or reduce what requires approval."
)
return {
"stats": stats,
"issues": issues,
"health": "degraded" if issues else "healthy",
}
Production Notes
:::warning Oversight Systems Must Be Actively Monitored An oversight system that is never reviewed degrades quickly. Build a monthly process: sample 50 approval decisions, review if they were understood, measure approval rate by risk level. An approval rate above 85% on high-risk actions is a signal your system is creating rubber stamps. :::
:::danger Never Default to Approve on Timeout The only safe default when a human does not respond in time is to deny the action and log it. Any system that approves by default on timeout becomes contingent on your alerting infrastructure - which will eventually fail. :::
:::tip Batch Approval Reduces Fatigue Instead of approving each step, present a multi-step plan for a single approval. "I plan to: (1) list all log files, (2) delete those older than 30 days, (3) email a summary. Approve this plan?" This dramatically reduces approval frequency while maintaining real oversight of the intent. :::
Testing oversight systems:
import pytest
def test_critical_action_always_requires_approval():
classifier = ActionRiskClassifier()
assessment = classifier.classify(
"execute_command",
{"command": "rm -rf /var/data/production"}
)
assert assessment.level == RiskLevel.CRITICAL
assert assessment.requires_approval is True
assert assessment.reversible is False
def test_read_file_never_requires_approval():
classifier = ActionRiskClassifier()
assessment = classifier.classify("read_file", {"path": "/home/user/notes.txt"})
assert assessment.level == RiskLevel.LOW
assert assessment.requires_approval is False
def test_timeout_defaults_to_deny():
"""Verify that timeout_seconds=0 with default_on_timeout='deny' rejects."""
approval = ApprovalRequest(
request_id="test-001",
agent_id="agent-001",
task_id="task-001",
action_description="Test action",
consequence_if_approved="No real consequence",
consequence_if_denied="Test fails",
technical_detail="{}",
risk_level=RiskLevel.HIGH,
risk_reasons=[],
reversible=True,
created_at=datetime.now(timezone.utc),
timeout_seconds=0,
default_on_timeout="deny",
)
assert approval.default_on_timeout == "deny"
def test_graduated_autonomy_resets_on_incident():
manager = GraduatedAutonomyManager()
for _ in range(100):
manager.record_success("delete_file")
initial_autonomy = manager._get_record("delete_file").autonomy_level
assert initial_autonomy > 0.0
manager.record_incident("delete_file", severity="major")
post_incident_autonomy = manager._get_record("delete_file").autonomy_level
assert post_incident_autonomy < initial_autonomy
Interview Questions
Q: What is oversight fatigue and how do you engineer against it?
A: Oversight fatigue occurs when humans receive approval requests so frequently that they approve without genuine evaluation - the opposite of meaningful oversight. Engineering mitigations: (1) risk-based interruption - only pause for genuinely high-risk or irreversible actions, not every step; (2) approval interface design - show plain-language consequences, not raw JSON; (3) graduated autonomy - as the agent builds a track record, reduce what requires approval; (4) batch planning approval - approve a 10-step plan once rather than each step; (5) monitoring approval rates - if over 85% of high-risk approvals are approved instantly, the system is creating rubber stamps.
Q: How do you design an approval interface that enables genuine human understanding rather than mechanical clicking?
A: The key principles: (1) show consequences not mechanisms - "permanently delete 3,400 customer records" not the SQL query; (2) make reversibility explicit - "this cannot be undone" is more informative than any technical detail; (3) pre-compute the denial consequence - the human needs to know what happens if they say no; (4) safe default on timeout - if no response, deny; (5) single focal decision - never show multiple simultaneous approvals. Research from aviation and nuclear operations shows that when operators cannot understand what they are approving, they approve everything, which produces worse outcomes than no oversight.
Q: How does graduated autonomy work and why is it important?
A: Graduated autonomy systematically reduces required oversight as an agent builds a track record of correct, safe actions in a specific domain. A file-reading agent that has correctly read 10,000 files without any incident earns the right to read without approval. Implementation: track success and failure per action domain, increment an autonomy score on success using asymptotic growth, reset it on incidents with severity-based reduction, use the autonomy score to lower the approval threshold. This is critical because static approval requirements do not account for the fact that agents improve with use - and unnecessarily frequent approvals create the rubber-stamp problem.
Q: When an agent takes a human-approved action that causes harm, who is responsible?
A: Accountability is layered. The approving human bears responsibility for what they explicitly approved, but only if the interface gave them a genuine ability to understand the decision - a cryptic JSON blob does not transfer accountability. The system designer is responsible for whether the interface enabled genuine understanding. The deploying organization is responsible for whether the oversight system was calibrated appropriately for the risk. Most regulators will evaluate whether the oversight system was "reasonable given the risk" - which requires matching oversight depth to action severity, not applying a uniform policy to all actions.
Q: What should an immutable audit trail for agent actions contain?
A: A complete audit trail needs: timestamp (UTC), agent and task identifiers, the full tool call (name and arguments), risk assessment (level, score, reasons), whether approval was required and why, who approved (if applicable), what the approval interface showed at decision time, the outcome (success, failure, skipped), and any result summary. The trail serves three purposes: (1) debugging - exactly reproduce what happened when something fails; (2) accountability - establish the chain of decisions for any harm; (3) improvement - identify patterns in what gets approved and denied to calibrate the system. Store it in an append-only log store and retain for the duration required by your industry regulations.
Q: How do you handle the case where no human is available to approve a high-risk action?
A: The agent should: (1) check if there is an async approval channel - Slack, email, web dashboard - and use it even if the human is not actively monitoring; (2) set a reasonable timeout (5 to 10 minutes) with "deny" as the default action; (3) continue executing safe subtasks while waiting for approval of the risky one where possible; (4) when timeout expires with no response, skip the action and log it clearly for later review; (5) never auto-approve on timeout - this would make the entire oversight system contingent on your alerting infrastructure, which will fail. The agent should also report back to the user: "I paused waiting for approval of a high-risk action. Your administrator has been notified. The task will resume when approved."
