The 4 Agent Memory Types — AI Letters #07

Memory Type 1

In-Context Memory

The message history — everything in the current context window. The only memory the LLM directly reads.

0ms latency ~200K tokens max Ephemeral

What It Stores

The full conversation history: system prompt, user messages, assistant responses, tool calls, and tool results. All of it is present in the input to every LLM inference call. This is working RAM — anything not in the context window is invisible to the model.

Read / Write Pattern

Operation	How
Read	Automatic — the entire context is always in the model's input
Write	Append-only — new messages and tool results added each step
Delete	Context truncation or summarization (eviction)

Latency Profile

Access latency: 0ms — it's already in the input.

Access speed

Instant

Capacity

GPT-4o: ~128K tokens. Claude 3.5: ~200K tokens. Gemini 1.5 Pro: ~1M tokens. Context budget gets consumed by system prompt (~2K), tool schemas (~3K), accumulated tool results (can reach 40-80K in long-running agents), leaving less than you expect for new input.

Code Pattern

# In-context memory is just the messages list.
# Managed automatically by your agent framework.
messages = [
  {"role": "system", "content": system_prompt},
  {"role": "user", "content": user_task},
  # ... tool calls and results appended each loop step
]

# When budget is tight: summarize and evict old steps
if count_tokens(messages) > 0.8 * MAX_TOKENS:
    messages = summarize_and_truncate(messages)

Best Use Case

Short-to-medium tasks within a single session. Anything the agent needs to reason over in the current step. Tool results that must be cross-referenced in the same reasoning step.

Failure Mode

Context overflow. As the agent runs more steps, accumulated tool results fill the window. The model begins to lose early context. Performance degrades silently — the model still responds, but misses information from earlier steps. Most teams discover this in load testing, not development.

Memory Type 2

Semantic Memory

Vector database storage of facts, documents, and knowledge. Queried by embedding similarity, not keyword match.

10-80ms latency Millions of entries Persistent

What It Stores

Encoded knowledge: document chunks, facts, summaries, tool outputs worth preserving, extracted entities. Each item is stored as a dense vector (embedding) alongside the original text and metadata. The vector is what enables semantic retrieval — similar meaning = nearby vector.

Read / Write Pattern

Operation	How
Write	Embed text → upsert (doc, embedding, metadata) by ID
Read	Embed query → ANN search → top-k results returned as text
Update	Upsert with same ID replaces existing entry
Delete	Delete by ID or metadata filter (e.g., TTL expiry)

Latency Profile

ChromaDB local

~15ms

Pinecone cloud

~80ms

pgvector HNSW

~30ms

Code Pattern

# Write: embed and upsert a fact
collection.upsert(
    documents=[fact_text],
    metadatas=[{"source": src, "ttl": expiry}],
    ids=[fact_id]
)

# Read: semantic search at agent step N
results = collection.query(
    query_texts=[current_subtask],
    n_results=5,
    where={"ttl": {"$gt": now()}}  # TTL filter
)
# Inject results into agent context before LLM call

Best Use Case

Large document corpora, product catalogs, research papers, code bases, internal wikis. Any situation where the agent needs to answer "what do I know about X?" across a body of knowledge too large to fit in context.

Failure Mode

Training-serving skew: short queries and long stored documents embed to different regions of the vector space. Retrieval scores drop gradually. You get plausible-looking but wrong results. Mitigate with HyDE (generate a hypothetical answer, embed it instead of the raw query) or store facts at the same length as your typical query.

Memory Type 3

Episodic Memory

Records of past agent runs: what tasks were done, what tools were called, what conclusions were reached.

10-80ms latency Unlimited records Persistent

What It Stores

Task records: the original goal, the sequence of steps taken, tool calls made, observations from each step, final conclusion, outcome (success/partial/failed), duration, timestamp. Think of it as a structured log of the agent's autobiographical history — what it experienced, not just what it knows.

Read / Write Pattern

Operation	How
Write	At task completion — store full episode record as vector + JSON metadata
Read	At task start — query by task similarity, inject "past experience" into system prompt
Filter	By outcome (only successful), by date (recency), by user (multi-tenant)

Latency Profile

Query speed

~20ms

Only queried once per task (at start), not in the inner loop. Latency is not the bottleneck here.

Code Pattern

# At task START: retrieve similar past episodes
past = episode_memory.query(
    query_texts=[current_task],
    n_results=3,
    where={"outcome": {"$in": ["success", "partial"]}}
)
# Inject into system prompt as "PAST EXPERIENCE:"

# At task END: store the full episode
episode_memory.upsert(
    documents=[f"Task: {task}\nConclusion: {conclusion}"],
    metadatas=[{"steps_json": json.dumps(steps),
                "outcome": outcome}],
    ids=[episode_id]
)

Best Use Case

Long-running agent systems that handle recurring task types. Research agents that should not re-analyze documents they've already processed. Customer support agents that should remember past interactions. Any system where "I've done something like this before" is a useful signal for strategy selection.

Failure Mode

Stale episodes contaminate strategy. An agent that failed at a task 6 months ago retrieves that failure context and applies the same failed approach again. Or worse: a successful episode from when an API worked differently is retrieved and followed, causing silent errors. Expire episodes aggressively. Weight recency heavily in retrieval scoring.

Memory Type 4

Procedural Memory

How the agent operates: few-shot examples, tool-calling patterns, reasoning templates. Baked into system prompt or fine-tuned weights.

0ms latency Fixed at setup Semi-persistent

What It Stores

Behavioral patterns: how to structure reasoning (chain-of-thought templates), how to call tools (few-shot examples of successful tool use), domain-specific heuristics ("always verify numerical outputs with a second calculation"), safety constraints, output format specifications. Less declarative knowledge, more operational policy.

Read / Write Pattern

Operation	How
Read	Always-on — part of the system prompt, present in every call
Write	Manual (edit system prompt) or via fine-tuning (weights)
Update	Redeploy system prompt or retrain. No runtime updates.

Latency Profile

Access speed

Instant

System prompt is always in context. Zero additional latency. But it consumes token budget permanently — every token in the system prompt is a token not available for task content.

Code Pattern

# Procedural memory = the system prompt (+ few-shots)
SYSTEM_PROMPT = """
You are a research agent. Follow this protocol:

1. Before searching, check if similar work was done recently
2. Always cite sources with year and author
3. When results are uncertain, call verify_fact() before concluding
4. Example of correct tool usage:
   User: Summarize recent RLHF papers
   Tool: search_arxiv(query="RLHF 2024", max_results=10)
   [never search without a date constraint]
"""

# Update procedural memory = redeploy system prompt
# Or: fine-tune the model on successful agent trajectories

Best Use Case

Stable, well-understood operational patterns that apply to every task the agent handles. Safety constraints that must never be overridden by context. Domain expertise that takes a long time to acquire but applies universally (e.g., medical coding rules, legal citation formats). Anything you want the agent to "just know" without retrieval overhead.

Failure Mode

Procedural drift: the system prompt encodes patterns that made sense when the agent was deployed but are now outdated or counterproductive. New tool APIs, changed constraints, updated policies — none of these update procedural memory automatically. Teams that don't audit system prompts regularly find their agents following stale behavioral rules. Version-control your system prompts and treat changes as deployments, not edits.

Memory Type	Latency	Capacity	Persistence	Update Mechanism
In-Context	0ms	~200K tokens (model limit)	Session only	Automatic append
Semantic	10-80ms	Millions of chunks	Permanent (with TTL)	Upsert by ID
Episodic	10-80ms	Unlimited task records	Permanent (with expiry)	Write at task end
Procedural	0ms	Tokens in system prompt	Until redeployment	Manual / fine-tuning