Procedural Memory and Learned Skills
The Expert Effect
After the tenth deployment script, you stop looking things up. After the twentieth JWT implementation, you no longer have to reason through each step. The procedure is internalized. You execute it with minimal cognitive load, adjusting only the parameters that differ each time.
This is procedural memory - and it is the most economically valuable form of memory in practical AI systems.
An agent without procedural memory reasons from scratch every time it encounters a task type it has handled before. It reads documentation again. It generates each step through expensive reasoning. It makes the same mistakes it has already made before.
An agent with procedural memory recognizes "this is a Docker deployment task," retrieves the stored deployment workflow, fills in the specific parameters for this task, and executes - skipping the expensive reasoning that has already been done and validated.
The cost reduction is dramatic. The reliability improvement is dramatic. And unlike fine-tuning (which bakes procedural knowledge into model weights), procedural memory stores skills externally where they can be inspected, updated, and composed without retraining.
:::tip 🎮 Interactive Playground Visualize this concept: Try the Procedural Memory & Agent Skills demo on the EngineersOfAI Playground - no code required. :::
Why Procedural Memory Matters for Production Agents
The argument becomes concrete when you look at what happens without it.
Without procedural memory: An agent tasked with "deploy the API service" reasons through steps from scratch. It might use different tool orderings, forget to wait for health checks, miss the environment variable validation step. It takes the same thinking time for the hundredth deployment as for the first.
With procedural memory: The agent retrieves the stored "API service deployment" skill. It gets the correct 8-step sequence that includes health check waiting and env var validation - validated by 99 successful previous deployments. It fills in the service name and registry URL. It executes in a fraction of the reasoning time.
The pattern generalizes. Every recurring task type that an agent handles is a candidate for procedural memory:
- Debugging a specific error type
- Setting up a particular framework
- Following a company's specific release process
- Generating reports in a specific format
- Handling escalation workflows in customer support
The more specialized and repeatable the task, the more value procedural memory provides.
Procedural Memory in Cognitive Science
In human cognitive science, procedural memory is implicit - you cannot easily verbalize the rules. Try to explain exactly how you balance when riding a bike. You cannot. You just do it.
For AI agents, the analogous concept is different: stored but retrievable explicit procedures. Agent procedural memory is more like the "standard operating procedures" a hospital writes down than the implicit riding-a-bike knowledge. Explicit, inspectable, updatable.
Anderson's ACT-R theory (1983): Procedural knowledge represented as production rules: IF condition THEN action. An agent skill is essentially a production rule with a more complex action sequence.
SOAR architecture (1987): Used procedural "operators" to decompose tasks into sub-tasks. The insight: procedures are not monolithic but compositional - complex skills are built from primitive skills.
ReAct (Yao et al., 2022): Demonstrated that LLMs can execute interleaved reasoning and action sequences. Procedural memory provides the scaffold for reliable ReAct execution - the stored sequence tells the model which actions to take, reducing the reasoning burden.
Skill Schema Design
A well-designed skill schema enables precise retrieval, reliable execution, and meaningful update tracking.
from dataclasses import dataclass, field
import time
import uuid
from typing import Optional
@dataclass
class Skill:
"""
A reusable action sequence stored in procedural memory.
A skill captures:
- When to use it (name, description, preconditions)
- How to execute it (action_sequence)
- How well it works (success/failure tracking)
- When it was last validated (last_used)
"""
# Identity
id: str = field(default_factory=lambda: str(uuid.uuid4()))
name: str = "" # Short identifier (e.g., "docker_ecr_deploy")
description: str = "" # When to use this skill (for matching)
category: str = "general" # deployment | debugging | setup | analysis
# Execution
preconditions: list[str] = field(default_factory=list) # Must be true before starting
action_sequence: list[str] = field(default_factory=list) # Steps to execute
parameters: list[str] = field(default_factory=list) # Required params ({service_name})
postconditions: list[str] = field(default_factory=list) # Expected state after completion
# Performance tracking
success_count: int = 0
failure_count: int = 0
partial_success_count: int = 0
last_used: float = field(default_factory=time.time)
created_at: float = field(default_factory=time.time)
# Quality metadata
source: str = "manual" # manual | learned | imported
confidence: float = 1.0 # Prior confidence before any execution
tags: list[str] = field(default_factory=list)
sub_skills: list[str] = field(default_factory=list) # IDs of component skills
@property
def reliability(self) -> float:
"""Success rate across all executions. Partial success counts as 0.5."""
total = self.success_count + self.failure_count + self.partial_success_count
if total == 0:
return self.confidence
weighted = self.success_count + self.partial_success_count * 0.5
return weighted / total
@property
def total_executions(self) -> int:
return self.success_count + self.failure_count + self.partial_success_count
def to_context_text(self, indent: str = " ") -> str:
"""Format skill for injection into agent context."""
lines = [
f"Skill: {self.name}",
f"Reliability: {self.reliability:.0%} ({self.total_executions} executions)",
]
if self.preconditions:
lines.append("Preconditions:")
lines.extend(f"{indent}✓ {p}" for p in self.preconditions)
lines.append("Steps:")
lines.extend(f"{indent}{i+1}. {step}" for i, step in enumerate(self.action_sequence))
if self.postconditions:
lines.append("Expected outcome:")
lines.extend(f"{indent}→ {p}" for p in self.postconditions)
return "\n".join(lines)
The Skill Lifecycle
Skill Formation: When to Store a New Skill
Not every successful task execution should become a skill. Storage should be selective.
Store when:
- The task was completed successfully
- The task required more than 3 non-trivial steps
- The task type is likely to recur (deployment, debugging, report generation)
- The solution was not obvious from general knowledge alone
Do not store when:
- The task was trivially simple (answer a factual question)
- The task was highly specific and unlikely to recur
- A very similar skill already exists - consolidate instead
class SkillFormationPolicy:
"""Decides whether a completed trajectory should be stored as a skill."""
RECURRENCE_CATEGORIES = [
"deployment", "debugging", "setup", "configuration",
"testing", "monitoring", "data-processing", "reporting",
]
RECURRING_KEYWORDS = [
"deploy", "debug", "setup", "configure", "install",
"test", "monitor", "generate", "run", "build", "migrate",
]
def __init__(self, min_steps: int = 3):
self.min_steps = min_steps
def should_store(
self,
task_description: str,
steps_completed: int,
success: bool,
category: str = "general",
) -> tuple[bool, str]:
if not success:
return False, "task_failed"
if steps_completed < self.min_steps:
return False, f"too_simple_{steps_completed}_steps"
if category in self.RECURRENCE_CATEGORIES:
return True, f"recurrent_category_{category}"
if any(kw in task_description.lower() for kw in self.RECURRING_KEYWORDS):
return True, "recurring_keyword_match"
return False, "one_time_task"
Full Implementation: Procedural Memory System
"""
Procedural Memory System for AI Agents.
Stores successful action sequences as reusable skills.
Supports retrieval by task similarity, execution tracking,
skill composition, and reliability-based updates.
Install: pip install anthropic
"""
from __future__ import annotations
import json
import math
import time
import uuid
from dataclasses import dataclass, field
from typing import Optional
import anthropic
@dataclass
class Skill:
id: str = field(default_factory=lambda: str(uuid.uuid4()))
name: str = ""
description: str = ""
category: str = "general"
preconditions: list[str] = field(default_factory=list)
action_sequence: list[str] = field(default_factory=list)
parameters: list[str] = field(default_factory=list)
postconditions: list[str] = field(default_factory=list)
success_count: int = 0
failure_count: int = 0
partial_success_count: int = 0
last_used: float = field(default_factory=time.time)
created_at: float = field(default_factory=time.time)
source: str = "manual"
confidence: float = 1.0
tags: list[str] = field(default_factory=list)
sub_skills: list[str] = field(default_factory=list)
@property
def reliability(self) -> float:
total = self.success_count + self.failure_count + self.partial_success_count
if total == 0:
return self.confidence
weighted = self.success_count + self.partial_success_count * 0.5
return weighted / total
@property
def total_executions(self) -> int:
return self.success_count + self.failure_count + self.partial_success_count
def to_context_text(self) -> str:
lines = [
f"Skill: {self.name}",
f"Reliability: {self.reliability:.0%} ({self.total_executions} executions)",
]
if self.preconditions:
lines.append("Preconditions:")
lines.extend(f" ✓ {p}" for p in self.preconditions)
lines.append("Steps:")
lines.extend(f" {i+1}. {step}" for i, step in enumerate(self.action_sequence))
if self.postconditions:
lines.append("Expected outcome:")
lines.extend(f" → {p}" for p in self.postconditions)
return "\n".join(lines)
# ─────────────────────────────────────────────
# PROCEDURAL MEMORY STORE
# ─────────────────────────────────────────────
class ProceduralMemoryStore:
"""
Stores and retrieves agent skills.
Storage: In-memory for demo.
Production: SQLite or Redis with persistence.
Retrieval: Keyword similarity - production: embedding similarity.
"""
RETIRE_THRESHOLD = 0.30
RETIRE_MIN_EXECUTIONS = 10
def __init__(self):
self.skills: dict[str, Skill] = {}
self.category_index: dict[str, list[str]] = {}
self.client = anthropic.Anthropic()
# ─── STORAGE ──────────────────────────────────────────────────
def store(self, skill: Skill) -> Skill:
"""Store a new skill. Consolidate if a near-duplicate exists."""
similar = self.find_similar_skill(skill.description, threshold=0.75)
if similar:
print(f"[ProceduralStore] Similar skill exists: '{similar.name}'. Consolidating.")
self._consolidate(similar, skill)
return similar
self.skills[skill.id] = skill
if skill.category not in self.category_index:
self.category_index[skill.category] = []
self.category_index[skill.category].append(skill.id)
print(f"[ProceduralStore] Stored new skill: '{skill.name}' (category: {skill.category})")
return skill
def store_from_trajectory(
self,
task_description: str,
steps: list[str],
success: bool,
category: str = "general",
tags: list[str] | None = None,
) -> Optional[Skill]:
"""Create and store a skill from a completed agent trajectory."""
if not success:
return None
metadata = self._extract_skill_metadata(task_description, steps)
skill = Skill(
name=metadata.get("name", task_description[:40]),
description=metadata.get("description", task_description),
category=category,
preconditions=metadata.get("preconditions", []),
action_sequence=steps,
postconditions=metadata.get("postconditions", []),
parameters=metadata.get("parameters", []),
success_count=1,
source="learned",
tags=tags or [],
)
return self.store(skill)
def _extract_skill_metadata(self, task_description: str, steps: list[str]) -> dict:
"""Use LLM to generate structured skill metadata from a trajectory."""
prompt = f"""Create metadata for a reusable agent skill.
Task: {task_description}
Steps: {chr(10).join(f'{i+1}. {s}' for i, s in enumerate(steps))}
Return JSON:
{{
"name": "short_snake_case_name",
"description": "1-2 sentences: when to use this skill",
"preconditions": ["condition 1", "condition 2"],
"parameters": ["{{param1}}", "{{param2}}"],
"postconditions": ["expected result 1"]
}}
Return only valid JSON."""
try:
response = self.client.messages.create(
model="claude-haiku-4-5",
max_tokens=400,
messages=[{"role": "user", "content": prompt}],
)
raw = response.content[0].text.strip()
if raw.startswith("```"):
raw = "\n".join(raw.split("\n")[1:-1])
return json.loads(raw)
except Exception:
return {
"name": task_description[:40].lower().replace(" ", "_"),
"description": task_description,
"preconditions": [],
"parameters": [],
"postconditions": [],
}
def _consolidate(self, existing: Skill, new: Skill) -> None:
"""Update existing skill when a similar one is stored."""
existing.success_count += new.success_count
existing.last_used = time.time()
if len(new.action_sequence) < len(existing.action_sequence) and new.success_count > 0:
print(f"[ProceduralStore] New trajectory is shorter - consider reviewing for refinement")
# ─── RETRIEVAL ─────────────────────────────────────────────────
def retrieve(
self,
task_description: str,
category: str | None = None,
top_k: int = 3,
min_reliability: float = 0.0,
) -> list[tuple[Skill, float]]:
"""Find skills matching a task description. Returns (Skill, score) pairs."""
query_words = set(task_description.lower().split())
scored: list[tuple[float, Skill]] = []
candidates = list(self.skills.values())
if category:
cat_ids = set(self.category_index.get(category, []))
candidates = [s for s in candidates if s.id in cat_ids]
for skill in candidates:
if skill.reliability < min_reliability:
continue
# Keyword overlap with description and name
skill_words = set(skill.description.lower().split())
name_words = set(skill.name.replace("_", " ").lower().split())
all_skill_words = skill_words | name_words
overlap = len(query_words & all_skill_words)
if overlap == 0:
continue
keyword_score = overlap / max(len(query_words), 1)
reliability_boost = skill.reliability * 0.3
days_since_use = (time.time() - skill.last_used) / 86400
recency = 1.0 / (1.0 + days_since_use * 0.1)
score = keyword_score * 0.6 + reliability_boost + recency * 0.1
scored.append((score, skill))
scored.sort(key=lambda x: x[0], reverse=True)
return [(s, score) for score, s in scored[:top_k]]
def find_similar_skill(self, description: str, threshold: float = 0.80) -> Optional[Skill]:
"""Find a skill very similar to the given description (for deduplication)."""
results = self.retrieve(description, top_k=1)
if results and results[0][1] >= threshold:
return results[0][0]
return None
# ─── EXECUTION TRACKING ────────────────────────────────────────
def record_outcome(
self,
skill_id: str,
outcome: str, # "success" | "failure" | "partial"
notes: str = "",
) -> None:
"""Record the outcome of a skill execution."""
skill = self.skills.get(skill_id)
if not skill:
return
if outcome == "success":
skill.success_count += 1
elif outcome == "failure":
skill.failure_count += 1
elif outcome == "partial":
skill.partial_success_count += 1
skill.last_used = time.time()
# Auto-retire unreliable skills after sufficient executions
if (skill.total_executions >= self.RETIRE_MIN_EXECUTIONS
and skill.reliability < self.RETIRE_THRESHOLD):
print(f"[ProceduralStore] Retiring '{skill.name}' (reliability: {skill.reliability:.0%})")
del self.skills[skill_id]
for cat_ids in self.category_index.values():
if skill_id in cat_ids:
cat_ids.remove(skill_id)
def refine_skill(self, skill_id: str, new_steps: list[str], success: bool) -> bool:
"""Update a skill's action sequence if a better trajectory is found."""
skill = self.skills.get(skill_id)
if not skill or not success:
return False
if len(new_steps) < len(skill.action_sequence):
print(f"[ProceduralStore] Refining '{skill.name}': "
f"{len(skill.action_sequence)} → {len(new_steps)} steps")
skill.action_sequence = new_steps
skill.last_used = time.time()
return True
return False
# ─── COMPOSITION ─────────────────────────────────────────────
def compose_skills(
self,
skill_ids: list[str],
composite_name: str,
composite_description: str,
) -> Optional[Skill]:
"""Combine multiple primitive skills into a composite skill."""
components = [self.skills[sid] for sid in skill_ids if sid in self.skills]
if not components:
return None
combined_steps = []
for skill in components:
combined_steps.append(f"--- Phase: {skill.name} ---")
combined_steps.extend(skill.action_sequence)
all_preconditions: list[str] = []
for skill in components:
all_preconditions.extend(skill.preconditions)
composite = Skill(
name=composite_name,
description=composite_description,
category="composite",
preconditions=list(dict.fromkeys(all_preconditions)),
action_sequence=combined_steps,
sub_skills=[s.id for s in components],
source="composed",
)
return self.store(composite)
def stats(self) -> dict:
total = len(self.skills)
by_category = {cat: len(ids) for cat, ids in self.category_index.items()}
avg_reliability = (
sum(s.reliability for s in self.skills.values()) / total if total > 0 else 0
)
return {
"total_skills": total,
"by_category": by_category,
"avg_reliability": f"{avg_reliability:.0%}",
}
# ─────────────────────────────────────────────
# PROCEDURAL AGENT
# ─────────────────────────────────────────────
class ProceduralAgent:
"""Agent that retrieves and uses skills from procedural memory."""
MODEL = "claude-opus-4-6"
def __init__(self):
self.client = anthropic.Anthropic()
self.procedural = ProceduralMemoryStore()
self.conversation: list[dict] = []
self._seed_skills()
def _seed_skills(self) -> None:
"""Pre-populate with known workflows."""
# Kubernetes deployment
self.procedural.store(Skill(
name="kubernetes_rolling_deploy",
description="Deploy a containerized service to Kubernetes using rolling update strategy",
category="deployment",
preconditions=[
"kubectl configured and authenticated to target cluster",
"Docker image built and pushed to registry",
],
action_sequence=[
"1. Verify image: docker manifest inspect {registry}/{image}:{tag}",
"2. Update image: kubectl set image deployment/{name} {name}={registry}/{image}:{tag} -n {namespace}",
"3. Monitor rollout: kubectl rollout status deployment/{name} -n {namespace} --timeout=300s",
"4. Verify pods: kubectl get pods -l app={name} -n {namespace}",
"5. Smoke test: curl -f http://{service_host}/health",
"6. Check logs: kubectl logs deployment/{name} --tail=50 | grep -E 'ERROR|FATAL'",
"7. If issues: kubectl rollout undo deployment/{name} -n {namespace}",
],
postconditions=[
"All pods running new image version",
"Health endpoint returning 200",
"No ERROR-level logs in the last minute",
],
parameters=["{registry}", "{image}", "{tag}", "{name}", "{namespace}", "{service_host}"],
success_count=47,
failure_count=3,
source="manual",
tags=["kubernetes", "deployment"],
))
# API latency debugging
self.procedural.store(Skill(
name="api_latency_investigation",
description="Debug and resolve high API latency or timeout issues in a web service",
category="debugging",
preconditions=[
"Access to application logs",
"Monitoring dashboard available",
],
action_sequence=[
"1. Check p95/p99 latency by route - identify the slow endpoints",
"2. Search logs: grep 'slow_query' application.log | sort -k4 -rn | head -20",
"3. Review N+1 query patterns in APM traces",
"4. Check connection pool: look for 'pool_timeout' or 'connection refused'",
"5. Run EXPLAIN ANALYZE on the slowest database queries",
"6. Check for blocking synchronous calls to external services",
"7. Apply fix: add index, optimize query, add caching, or convert to async",
"8. Deploy and monitor p95 latency for 15 minutes",
],
postconditions=[
"p95 latency within SLA",
"No timeout errors in last 5 minutes",
],
success_count=23,
failure_count=4,
partial_success_count=5,
source="manual",
tags=["debugging", "latency", "database"],
))
# Database migration
self.procedural.store(Skill(
name="postgresql_migration_deploy",
description="Deploy a PostgreSQL database migration safely in production using Alembic",
category="deployment",
preconditions=[
"Migration tested in staging",
"Database backup completed within last 24 hours",
],
action_sequence=[
"1. Backup schema: pg_dump --schema-only {db_name} > schema_backup.sql",
"2. Test on staging: alembic upgrade head",
"3. Verify reversible: alembic downgrade -1 on staging",
"4. Run on production: alembic upgrade head",
"5. Verify schema: psql -c '\\d+ {table_name}'",
"6. Run smoke tests",
"7. Monitor error rates for 10 minutes",
"8. If issues: alembic downgrade -1",
],
success_count=31,
failure_count=1,
source="manual",
tags=["database", "migration", "postgresql"],
))
def respond(self, user_message: str) -> str:
"""Respond using relevant skills from procedural memory."""
matching_skills = self.procedural.retrieve(task_description=user_message, top_k=2)
system = (
"You are an expert DevOps and engineering assistant. "
"When skills are provided, follow the steps precisely - "
"they represent validated procedures. "
"Adapt parameter placeholders ({service_name}, etc.) to the specific context.\n"
)
if matching_skills:
system += "\n## Relevant Skills from Procedural Memory\n"
for skill, score in matching_skills:
system += f"\n{skill.to_context_text()}\n"
system += f"[Retrieved with match score: {score:.0%}]\n"
else:
system += "\nNo stored procedures match this task. Reason through the steps carefully.\n"
self.conversation.append({"role": "user", "content": user_message})
response = self.client.messages.create(
model=self.MODEL,
max_tokens=1024,
system=system,
messages=self.conversation,
)
assistant_text = response.content[0].text
self.conversation.append({"role": "assistant", "content": assistant_text})
return assistant_text
def learn_from_trajectory(
self,
task_description: str,
steps: list[str],
success: bool,
category: str = "general",
) -> None:
"""Store a completed trajectory as a new skill."""
skill = self.procedural.store_from_trajectory(
task_description=task_description,
steps=steps,
success=success,
category=category,
)
if skill:
print(f"[ProceduralAgent] Learned new skill: '{skill.name}'")
def record_task_outcome(self, task_description: str, outcome: str) -> None:
"""Record outcome for the most recently matched skill."""
matches = self.procedural.retrieve(task_description, top_k=1)
if matches:
skill, _ = matches[0]
self.procedural.record_outcome(skill.id, outcome)
print(f"[ProceduralAgent] Recorded '{outcome}' for skill '{skill.name}'")
# ─────────────────────────────────────────────
# DEMONSTRATION
# ─────────────────────────────────────────────
def demo():
print("=" * 60)
print("PROCEDURAL MEMORY - DEMONSTRATION")
print("=" * 60)
agent = ProceduralAgent()
print(f"\nInitialized: {agent.procedural.stats()}")
# Query 1: Deployment (matches stored skill)
print("\n\n=== QUERY 1: Kubernetes deployment ===")
r1 = agent.respond(
"I need to deploy the recommendation-service to Kubernetes. "
"Image: 123456789.dkr.ecr.us-east-1.amazonaws.com/rec-service:abc123. "
"Namespace: production."
)
print(f"Agent:\n{r1}")
# Query 2: Debugging (matches debug skill)
print("\n\n=== QUERY 2: API latency debugging ===")
r2 = agent.respond(
"Our user-profile API is showing 8-second latency on /profile endpoint. "
"Need to debug and fix this."
)
print(f"Agent:\n{r2}")
# Learn a new skill from a trajectory
print("\n\n=== LEARNING: New SSL setup skill ===")
agent.learn_from_trajectory(
task_description="Set up SSL/TLS certificate with Let's Encrypt and Nginx on Ubuntu server",
steps=[
"Install certbot: sudo apt install certbot python3-certbot-nginx",
"Obtain cert: sudo certbot --nginx -d {domain}",
"Verify renewal: sudo certbot renew --dry-run",
"Reload nginx: sudo nginx -s reload",
"Test HTTPS: curl -vI https://{domain} | grep 'SSL certificate'",
"Set up cron: 0 12 * * * certbot renew --quiet",
],
success=True,
category="setup",
)
print(f"\nUpdated skill store: {agent.procedural.stats()}")
# Query 3: SSL (uses newly learned skill)
print("\n\n=== QUERY 3: SSL setup (newly learned) ===")
r3 = agent.respond(
"I need to set up SSL certificates for api.finvault.com using Let's Encrypt."
)
print(f"Agent:\n{r3}")
# Skill composition example
print("\n\n=== SKILL COMPOSITION ===")
skill_ids = list(agent.procedural.skills.keys())[:2]
if len(skill_ids) >= 2:
composite = agent.procedural.compose_skills(
skill_ids=skill_ids,
composite_name="full_production_deployment",
composite_description="Complete production deployment: migrate database then deploy application",
)
if composite:
print(f"Created composite skill: '{composite.name}' with {len(composite.action_sequence)} steps")
print(f"\nFinal skill store: {agent.procedural.stats()}")
# Record outcomes
print("\n\n=== RECORDING OUTCOMES ===")
agent.record_task_outcome("deploy service to kubernetes", "success")
agent.record_task_outcome("debug API latency issues", "success")
if __name__ == "__main__":
demo()
Procedural Memory vs Fine-Tuning
When should you store skills in procedural memory versus baking them into model weights via fine-tuning?
| Dimension | Procedural Memory | Fine-Tuning |
|---|---|---|
| Update speed | Instant | Hours to days |
| Update cost | Near-zero | Significant compute |
| Inspectability | Full - readable text | Opaque - in weights |
| Forgetting | Never | Catastrophic forgetting risk |
| Context cost | Tokens per call | Zero |
| Scope | Bounded to stored skills | Generalizes across variants |
| Best for | Recurring specific procedures | Broad behavioral changes |
The hybrid approach: fine-tune the model on general task-handling behavior, then use procedural memory for specific frequently-updated workflows.
Procedural Memory and Few-Shot Prompting
Procedural memory has a direct relationship with dynamic few-shot prompting. A skill retrieved from procedural memory is, in effect, a dynamic few-shot example showing the model the right way to approach the task.
The advantage over static few-shot examples hardcoded in the system prompt:
- Adaptive: retrieved by relevance to the current task, not statically included
- Scalable: maintain hundreds of skills, inject only the relevant ones
- Updatable: refine skills based on outcome feedback without code changes
- Tracked: know which skills are used and how often they succeed
This is why procedural memory is sometimes called "dynamic few-shot memory" in the agent systems literature.
:::danger Skill Injection Security Skills injected from procedural memory are treated as authoritative instructions by the LLM. If an attacker can write to the skill store, they can inject malicious "skills" the agent will follow. Always authenticate and authorize skill writes, validate skill content before storage (reject content containing prompt injection patterns), and if executing skill steps programmatically, run them through a sandboxed executor. :::
:::warning Reliability Score Gaming Reliability scores only mean something if outcome tracking is honest. If you automatically record "success" for every task completion without verifying the actual outcome, skills accumulate inflated reliability scores and get retrieved with false confidence. Always verify success criteria explicitly - a health check endpoint, a test suite result, or a specific output pattern - before recording a "success" outcome for a skill execution. :::
Interview Questions and Answers
Q: How is procedural memory different from just putting workflows in the system prompt?
A: Static system prompt workflows take up token budget on every call regardless of relevance. If you have 20 workflows hardcoded, they cost thousands of tokens per call even when the current task only needs one of them. Procedural memory retrieves and injects only the relevant skill - dramatically reducing token cost and context noise. Scalability is the other major difference: you can maintain hundreds of skills in a procedural store, but cannot put hundreds of workflows in a system prompt. Finally, static prompts cannot track reliability - you have no way to know which hardcoded instructions are working and which are not. Procedural memory tracks execution outcomes per skill, enabling evidence-based curation and automatic retirement of unreliable procedures.
Q: How do you decide whether a task's trajectory should be stored as a new skill or used to refine an existing one?
A: Two-step process. First, check for an existing similar skill using semantic similarity of task descriptions. If a similar skill exists (similarity above 0.75–0.80), the outcome updates that skill's reliability score rather than creating a duplicate. If the new trajectory was shorter or more efficient while achieving the same result, consider updating the skill's action sequence with the improved steps. If no similar skill exists, check storage criteria: was the task successful, did it require 3+ non-trivial steps, and is it likely to recur? A one-time task (generate a specific one-page report for a specific meeting) should not become a skill - it will pollute retrieval with irrelevant matches later.
Q: How does procedural memory relate to reinforcement learning from human feedback?
A: Both are mechanisms for learning from outcome feedback, but they operate at different levels. RLHF modifies model weights based on preference data, changing the model's general behavior across all tasks. Procedural memory stores explicit action sequences based on outcome feedback, changing agent behavior for specific task types without modifying the model. Procedural memory is faster to update (no training required), more interpretable (you can read the stored steps), and more targeted (affects only matching task types). RLHF is better for broad behavioral changes affecting all tasks. In practice they complement each other: RLHF shapes general reasoning and communication style; procedural memory provides specific operational knowledge for recurring task types.
Q: How would you implement skill composition for complex multi-phase tasks?
A: Store composite skills with a sub_skills list referencing component skill IDs. When composing, concatenate all component steps with phase separator labels so the agent can track which phase it is in. For retrieval, composite skills should be discoverable both by their own description and by keywords from their components. For failure tracking, record outcomes against specific components, not just the composite - this way you know whether the database migration phase failed or the deployment phase failed. Design principle: keep primitive skills small (3–8 steps) and general-purpose so they compose flexibly. A large monolithic "full deployment" skill is harder to compose than separate "database migration" and "service deployment" primitives.
Q: How do you handle skills that become outdated when infrastructure changes?
A: Reliability tracking is the first defense - if a previously-reliable skill starts failing because the infrastructure changed, its reliability score drops and it gets de-prioritized in retrieval automatically. Set up staleness alerts for skills not successfully executed in 90+ days. When infrastructure changes significantly, create new skills for the new infrastructure rather than modifying existing ones - keep old skills archived (not deleted) in case you need to roll back. Provide explicit invalidation tooling for humans to mark skills as outdated when they know a procedure has changed. For the most robust approach, periodically test skill steps against current infrastructure in a sandbox to verify they still produce the expected results.
