AI Letters #07 · Interactive Reference

The 4 Agent Memory Types

Architecture profiles for every memory layer in production agents. Click each card to expand.

Click any card to explore the full architecture profile ↓

Memory Type 1
In-Context Memory
The message history — everything in the current context window. The only memory the LLM directly reads.
0ms latency ~200K tokens max Ephemeral

What It Stores
The full conversation history: system prompt, user messages, assistant responses, tool calls, and tool results. All of it is present in the input to every LLM inference call. This is working RAM — anything not in the context window is invisible to the model.
Read / Write Pattern
OperationHow
ReadAutomatic — the entire context is always in the model's input
WriteAppend-only — new messages and tool results added each step
DeleteContext truncation or summarization (eviction)
Latency Profile
Access latency: 0ms — it's already in the input.
Access speed
Instant
Capacity
GPT-4o: ~128K tokens. Claude 3.5: ~200K tokens. Gemini 1.5 Pro: ~1M tokens. Context budget gets consumed by system prompt (~2K), tool schemas (~3K), accumulated tool results (can reach 40-80K in long-running agents), leaving less than you expect for new input.
Code Pattern
# In-context memory is just the messages list. # Managed automatically by your agent framework. messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_task}, # ... tool calls and results appended each loop step ] # When budget is tight: summarize and evict old steps if count_tokens(messages) > 0.8 * MAX_TOKENS: messages = summarize_and_truncate(messages)
Best Use Case
Short-to-medium tasks within a single session. Anything the agent needs to reason over in the current step. Tool results that must be cross-referenced in the same reasoning step.
Failure Mode
Context overflow. As the agent runs more steps, accumulated tool results fill the window. The model begins to lose early context. Performance degrades silently — the model still responds, but misses information from earlier steps. Most teams discover this in load testing, not development.
Memory Type 2
Semantic Memory
Vector database storage of facts, documents, and knowledge. Queried by embedding similarity, not keyword match.
10-80ms latency Millions of entries Persistent

What It Stores
Encoded knowledge: document chunks, facts, summaries, tool outputs worth preserving, extracted entities. Each item is stored as a dense vector (embedding) alongside the original text and metadata. The vector is what enables semantic retrieval — similar meaning = nearby vector.
Read / Write Pattern
OperationHow
WriteEmbed text → upsert (doc, embedding, metadata) by ID
ReadEmbed query → ANN search → top-k results returned as text
UpdateUpsert with same ID replaces existing entry
DeleteDelete by ID or metadata filter (e.g., TTL expiry)
Latency Profile
ChromaDB local
~15ms
Pinecone cloud
~80ms
pgvector HNSW
~30ms
Code Pattern
# Write: embed and upsert a fact collection.upsert( documents=[fact_text], metadatas=[{"source": src, "ttl": expiry}], ids=[fact_id] ) # Read: semantic search at agent step N results = collection.query( query_texts=[current_subtask], n_results=5, where={"ttl": {"$gt": now()}} # TTL filter ) # Inject results into agent context before LLM call
Best Use Case
Large document corpora, product catalogs, research papers, code bases, internal wikis. Any situation where the agent needs to answer "what do I know about X?" across a body of knowledge too large to fit in context.
Failure Mode
Training-serving skew: short queries and long stored documents embed to different regions of the vector space. Retrieval scores drop gradually. You get plausible-looking but wrong results. Mitigate with HyDE (generate a hypothetical answer, embed it instead of the raw query) or store facts at the same length as your typical query.
Memory Type 3
Episodic Memory
Records of past agent runs: what tasks were done, what tools were called, what conclusions were reached.
10-80ms latency Unlimited records Persistent

What It Stores
Task records: the original goal, the sequence of steps taken, tool calls made, observations from each step, final conclusion, outcome (success/partial/failed), duration, timestamp. Think of it as a structured log of the agent's autobiographical history — what it experienced, not just what it knows.
Read / Write Pattern
OperationHow
WriteAt task completion — store full episode record as vector + JSON metadata
ReadAt task start — query by task similarity, inject "past experience" into system prompt
FilterBy outcome (only successful), by date (recency), by user (multi-tenant)
Latency Profile
Query speed
~20ms
Only queried once per task (at start), not in the inner loop. Latency is not the bottleneck here.
Code Pattern
# At task START: retrieve similar past episodes past = episode_memory.query( query_texts=[current_task], n_results=3, where={"outcome": {"$in": ["success", "partial"]}} ) # Inject into system prompt as "PAST EXPERIENCE:" # At task END: store the full episode episode_memory.upsert( documents=[f"Task: {task}\nConclusion: {conclusion}"], metadatas=[{"steps_json": json.dumps(steps), "outcome": outcome}], ids=[episode_id] )
Best Use Case
Long-running agent systems that handle recurring task types. Research agents that should not re-analyze documents they've already processed. Customer support agents that should remember past interactions. Any system where "I've done something like this before" is a useful signal for strategy selection.
Failure Mode
Stale episodes contaminate strategy. An agent that failed at a task 6 months ago retrieves that failure context and applies the same failed approach again. Or worse: a successful episode from when an API worked differently is retrieved and followed, causing silent errors. Expire episodes aggressively. Weight recency heavily in retrieval scoring.
Memory Type 4
Procedural Memory
How the agent operates: few-shot examples, tool-calling patterns, reasoning templates. Baked into system prompt or fine-tuned weights.
0ms latency Fixed at setup Semi-persistent

What It Stores
Behavioral patterns: how to structure reasoning (chain-of-thought templates), how to call tools (few-shot examples of successful tool use), domain-specific heuristics ("always verify numerical outputs with a second calculation"), safety constraints, output format specifications. Less declarative knowledge, more operational policy.
Read / Write Pattern
OperationHow
ReadAlways-on — part of the system prompt, present in every call
WriteManual (edit system prompt) or via fine-tuning (weights)
UpdateRedeploy system prompt or retrain. No runtime updates.
Latency Profile
Access speed
Instant
System prompt is always in context. Zero additional latency. But it consumes token budget permanently — every token in the system prompt is a token not available for task content.
Code Pattern
# Procedural memory = the system prompt (+ few-shots) SYSTEM_PROMPT = """ You are a research agent. Follow this protocol: 1. Before searching, check if similar work was done recently 2. Always cite sources with year and author 3. When results are uncertain, call verify_fact() before concluding 4. Example of correct tool usage: User: Summarize recent RLHF papers Tool: search_arxiv(query="RLHF 2024", max_results=10) [never search without a date constraint] """ # Update procedural memory = redeploy system prompt # Or: fine-tune the model on successful agent trajectories
Best Use Case
Stable, well-understood operational patterns that apply to every task the agent handles. Safety constraints that must never be overridden by context. Domain expertise that takes a long time to acquire but applies universally (e.g., medical coding rules, legal citation formats). Anything you want the agent to "just know" without retrieval overhead.
Failure Mode
Procedural drift: the system prompt encodes patterns that made sense when the agent was deployed but are now outdated or counterproductive. New tool APIs, changed constraints, updated policies — none of these update procedural memory automatically. Teams that don't audit system prompts regularly find their agents following stale behavioral rules. Version-control your system prompts and treat changes as deployments, not edits.
Quick Comparison
Memory Type Latency Capacity Persistence Update Mechanism
In-Context 0ms ~200K tokens (model limit) Session only Automatic append
Semantic 10-80ms Millions of chunks Permanent (with TTL) Upsert by ID
Episodic 10-80ms Unlimited task records Permanent (with expiry) Write at task end
Procedural 0ms Tokens in system prompt Until redeployment Manual / fine-tuning
www.engineersofai.com · AI Letters #07 · Agentic AI A-Z Series