Skip to main content

Memory Systems: Short-Term and Long-Term

A Production Scenario

You have deployed a customer success AI that handles thousands of conversations per day. In each individual conversation, it is excellent - it understands context, remembers what was said three messages ago, and builds on the conversation naturally. Users are happy.

Then you start getting complaints. A customer who reported the same billing issue three times gets asked for their account details on the fourth call. An account manager who negotiated a custom enterprise discount finds the AI has no idea about it on a follow-up call. A power user who explained their complex workflow setup in January has to explain it again every single session.

The agent has perfect short-term memory (the current conversation) and zero long-term memory (anything from previous sessions). From the user's perspective, this is not an intelligent system - it is a sophisticated amnesiac.

You add a database and store transcripts. Now the agent can look up past conversations. But the database has grown to 50,000 transcripts and the agent tries to load them all into the prompt. The context window fills up. Costs explode. Latency spikes. And the agent still cannot find the specific detail the user mentioned six weeks ago because searching 50,000 raw transcripts for a semantic match is not a simple keyword lookup.

What you actually need is a memory architecture - a system that manages different types of information at different time scales, stores information in appropriate formats, retrieves only what is relevant when it is needed, and keeps the context window focused on the current task. This is the memory problem in agent systems, and it is both harder and more important than most developers realize.


Why This Exists

The Fundamental Tension

Every LLM has a context window - a fixed-size buffer of tokens it can "see" at once. Older models had 4K tokens (roughly 3,000 words). Modern models have 200K+ tokens. But even with large context windows, unlimited memory is impossible:

  1. Cost: Tokens cost money. Injecting 100,000 tokens of conversation history into every request is economically unsustainable at scale
  2. Attention degradation: Research shows LLMs perform worse when relevant information is buried deep in a long context ("lost in the middle" problem, Liu et al. 2023)
  3. Sessions end: Context windows are ephemeral. Close the session, lose the context
  4. Knowledge vs. experience: Some things should be retrieved (facts about a specific user), others should be learned (patterns across all users)

The solution is a layered memory architecture, not a bigger context window.

Why a Simple Database Fails

Storing everything in a database and retrieving by keyword fails because: (1) user queries are semantic, not lexical - "that issue with my invoices" does not match the stored transcript "PDF export failing for enterprise tier"; (2) relevance is context-dependent - the same fact is relevant in some conversations and not others; (3) there is too much data to scan for each query.

The fix requires semantic retrieval (vector embeddings) combined with smart management of what gets stored, how it is compressed, and when it is retrieved.


The Four Memory Types

Think of agent memory as a hierarchy, analogous to human cognitive architecture:

In-Context Memory (Working Memory)

What is currently in the prompt. This includes the system prompt, conversation history, retrieved documents, and tool results. It is fast, always available, but strictly limited in size and completely lost when the session ends.

Management: sliding window (drop oldest messages), summarization (compress old messages into a summary), selective retention (keep only messages marked as important).

Episodic Memory (External Store)

Stored records of past interactions, retrieved by semantic similarity when relevant. This is the agent's "diary" - what happened, with whom, when.

Implementation: vector database (Chroma, Pinecone, Weaviate) with conversation chunks embedded and indexed. At the start of each session, retrieve the top-k most semantically similar past interactions.

Semantic Memory (Knowledge Base)

Facts, documents, domain knowledge. This is the agent's "encyclopedia" - what it knows about the world, users, products, policies.

Implementation: same vector database infrastructure as episodic memory, but different collections. Documents chunked, embedded, and indexed. Retrieved when the agent needs factual grounding.

Procedural Memory (Learned Behavior)

Skills and patterns that are "baked in" - either through the base model's pre-training or through fine-tuning on domain-specific data. Unlike the other three types, procedural memory is not retrieved at inference time - it is the model itself.

Implementation: fine-tuning on domain examples, prompt templates that encode best practices, or tool implementations that encode procedures.


Context Window Management

The most immediate memory problem is managing what stays in the in-context window as conversations grow long.

import anthropic
from dataclasses import dataclass
from typing import Literal

client = anthropic.Anthropic()

Message = dict # {"role": "user"|"assistant", "content": str}


def count_tokens(messages: list[Message]) -> int:
"""Estimate token count for a list of messages."""
# Rough estimate: 1 token per 4 characters
total_chars = sum(len(str(m.get("content", ""))) for m in messages)
return total_chars // 4


# ── Strategy 1: Sliding Window ────────────────────────────────────────────────

def sliding_window(
messages: list[Message],
max_tokens: int = 8000,
always_keep_first: int = 2 # Keep system context + first user message
) -> list[Message]:
"""
Drop oldest messages when context gets too large.
Always preserve the first N messages (usually system context).
"""
protected = messages[:always_keep_first]
sliding = messages[always_keep_first:]

while count_tokens(protected + sliding) > max_tokens and len(sliding) > 2:
sliding = sliding[2:] # Drop oldest user+assistant pair

return protected + sliding


# ── Strategy 2: Summarization ─────────────────────────────────────────────────

def summarize_old_messages(
messages: list[Message],
max_tokens: int = 8000,
keep_recent: int = 6
) -> list[Message]:
"""
Compress old messages into a summary, keep recent ones intact.
"""
if count_tokens(messages) <= max_tokens:
return messages

# Split: old messages to compress, recent to keep
if len(messages) <= keep_recent:
return messages

old_messages = messages[:-keep_recent]
recent_messages = messages[-keep_recent:]

# Summarize the old messages
summary_prompt = (
"Summarize the following conversation history. "
"Focus on: decisions made, facts established, user preferences, "
"and any unresolved issues. Be concise but complete.\n\n"
+ "\n".join(
f"{m['role'].upper()}: {m['content']}"
for m in old_messages
)
)

response = client.messages.create(
model="claude-opus-4-6",
max_tokens=500,
messages=[{"role": "user", "content": summary_prompt}]
)

summary = response.content[0].text
summary_message = {
"role": "user",
"content": f"[CONVERSATION SUMMARY - previous messages compressed]\n{summary}"
}

return [summary_message] + recent_messages


# ── Strategy 3: Selective Retention ───────────────────────────────────────────

@dataclass
class AnnotatedMessage:
role: str
content: str
importance: Literal["high", "medium", "low"] = "medium"
tags: list[str] = None

def to_message(self) -> Message:
return {"role": self.role, "content": self.content}


def selective_retention(
messages: list[AnnotatedMessage],
max_tokens: int = 8000
) -> list[Message]:
"""
Drop low-importance messages first when context is full.
"""
# Sort by importance for dropping candidates
priority = {"high": 3, "medium": 2, "low": 1}

plain_messages = [m.to_message() for m in messages]
if count_tokens(plain_messages) <= max_tokens:
return plain_messages

# Drop low-importance messages first (but always keep last 2)
candidates = list(enumerate(messages[:-2]))
candidates.sort(key=lambda x: priority[x[1].importance])

retained_indices = set(range(len(messages)))
retained_indices.add(len(messages) - 1)
retained_indices.add(len(messages) - 2)

for idx, msg in candidates:
current = [messages[i].to_message() for i in sorted(retained_indices)]
if count_tokens(current) <= max_tokens:
break
if msg.importance == "low":
retained_indices.discard(idx)

return [messages[i].to_message() for i in sorted(retained_indices)]

Episodic Memory with Vector Stores

Episodic memory stores past interactions as vector embeddings and retrieves them by semantic similarity at the start of new sessions.

import json
import uuid
from datetime import datetime
from typing import Any

# Using chromadb for local vector storage (swap for Pinecone/Weaviate in production)
try:
import chromadb
from chromadb.utils import embedding_functions
CHROMA_AVAILABLE = True
except ImportError:
CHROMA_AVAILABLE = False

import anthropic

client = anthropic.Anthropic()


class EpisodicMemory:
"""Store and retrieve past conversation episodes."""

def __init__(self, collection_name: str = "episodes"):
if CHROMA_AVAILABLE:
self.chroma_client = chromadb.Client()
self.embedding_fn = embedding_functions.DefaultEmbeddingFunction()
self.collection = self.chroma_client.get_or_create_collection(
name=collection_name,
embedding_function=self.embedding_fn
)
self._in_memory_store: list[dict] = [] # Fallback if Chroma unavailable

def store_episode(
self,
user_id: str,
conversation: list[Message],
metadata: dict | None = None
) -> str:
"""
Store a conversation episode after it ends.
Generates a summary embedding for retrieval.
"""
episode_id = str(uuid.uuid4())
timestamp = datetime.utcnow().isoformat()

# Create a retrievable summary of the conversation
summary = self._summarize_episode(conversation)

episode = {
"id": episode_id,
"user_id": user_id,
"timestamp": timestamp,
"summary": summary,
"full_conversation": json.dumps(conversation),
"metadata": metadata or {}
}

if CHROMA_AVAILABLE:
self.collection.add(
ids=[episode_id],
documents=[summary], # Embedding is computed from the summary
metadatas=[{
"user_id": user_id,
"timestamp": timestamp,
**{k: str(v) for k, v in (metadata or {}).items()}
}]
)
else:
self._in_memory_store.append(episode)

return episode_id

def retrieve_relevant_episodes(
self,
user_id: str,
current_query: str,
top_k: int = 3
) -> list[dict]:
"""
Retrieve the most semantically relevant past episodes for a user.
"""
if CHROMA_AVAILABLE:
results = self.collection.query(
query_texts=[current_query],
n_results=top_k,
where={"user_id": user_id}
)
return [
{"summary": doc, "metadata": meta}
for doc, meta in zip(
results["documents"][0],
results["metadatas"][0]
)
]
else:
# Simple keyword fallback for testing
user_episodes = [e for e in self._in_memory_store if e["user_id"] == user_id]
return user_episodes[-top_k:]

def _summarize_episode(self, conversation: list[Message]) -> str:
"""Generate a concise summary of a conversation for embedding."""
if not conversation:
return "Empty conversation"

conv_text = "\n".join(
f"{m['role'].upper()}: {m['content'][:200]}"
for m in conversation[:10] # Summarize first 10 turns
)

response = client.messages.create(
model="claude-opus-4-6",
max_tokens=300,
messages=[{
"role": "user",
"content": (
f"Summarize this conversation in 2-3 sentences. "
f"Focus on: what the user wanted, what was resolved, key facts established.\n\n"
f"{conv_text}"
)
}]
)
return response.content[0].text


# ── Entity Memory ──────────────────────────────────────────────────────────────

class EntityMemory:
"""Track and update facts about key entities (users, products, accounts)."""

def __init__(self):
self._entities: dict[str, dict] = {}

def update_entity(self, entity_id: str, new_info: dict) -> None:
"""Merge new information about an entity."""
if entity_id not in self._entities:
self._entities[entity_id] = {}
self._entities[entity_id].update(new_info)
self._entities[entity_id]["last_updated"] = datetime.utcnow().isoformat()

def get_entity(self, entity_id: str) -> dict:
return self._entities.get(entity_id, {})

def extract_and_update(
self,
entity_id: str,
conversation_turn: Message
) -> None:
"""Extract entity facts from a conversation turn and update the store."""
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=500,
messages=[{
"role": "user",
"content": (
f"Extract any new facts about the user from this message. "
f"Output as JSON dict of key-value pairs. "
f"Only include clear, factual information. "
f"Output empty dict {{}} if nothing new.\n\n"
f"Message: {conversation_turn['content']}"
)
}]
)
import re
json_match = re.search(r'\{[^{}]*\}', response.content[0].text)
if json_match:
try:
facts = json.loads(json_match.group())
if facts:
self.update_entity(entity_id, facts)
except json.JSONDecodeError:
pass


# ── Multi-Session Agent ────────────────────────────────────────────────────────

class MultiSessionAgent:
"""
A conversational agent that maintains memory across sessions.
"""

def __init__(self, agent_name: str = "Assistant"):
self.agent_name = agent_name
self.episodic_memory = EpisodicMemory()
self.entity_memory = EntityMemory()
self._current_session: dict[str, list[Message]] = {}

def start_session(self, user_id: str, initial_message: str) -> str:
"""Start a new session, loading relevant memories."""
# Retrieve relevant past episodes
past_episodes = self.episodic_memory.retrieve_relevant_episodes(
user_id=user_id,
current_query=initial_message,
top_k=3
)

# Get known facts about this user
user_facts = self.entity_memory.get_entity(user_id)

# Build memory context for the system prompt
memory_context = self._build_memory_context(past_episodes, user_facts)

system_prompt = f"""You are {self.agent_name}, a helpful assistant.

{memory_context}

Use this context to provide personalized, informed assistance.
If the user refers to something from a past conversation, you can reference it.
"""
self._current_session[user_id] = []

return self._respond(user_id, initial_message, system_prompt)

def continue_session(self, user_id: str, message: str) -> str:
"""Continue an existing session."""
if user_id not in self._current_session:
return self.start_session(user_id, message)

# Extract entity facts from user messages
self.entity_memory.extract_and_update(
user_id,
{"role": "user", "content": message}
)

return self._respond(user_id, message)

def end_session(self, user_id: str) -> None:
"""End the session and store it in episodic memory."""
if user_id in self._current_session:
conversation = self._current_session.pop(user_id)
if conversation:
self.episodic_memory.store_episode(
user_id=user_id,
conversation=conversation,
metadata={"session_end": datetime.utcnow().isoformat()}
)
print(f"[Memory] Session stored for user {user_id}")

def _respond(
self,
user_id: str,
message: str,
system_prompt: str | None = None
) -> str:
session = self._current_session.get(user_id, [])
session.append({"role": "user", "content": message})

# Apply context window management
managed_session = summarize_old_messages(session, max_tokens=8000)

kwargs = {
"model": "claude-opus-4-6",
"max_tokens": 1024,
"messages": managed_session
}
if system_prompt:
kwargs["system"] = system_prompt

response = client.messages.create(**kwargs)
assistant_reply = response.content[0].text

session.append({"role": "assistant", "content": assistant_reply})
self._current_session[user_id] = session

return assistant_reply

def _build_memory_context(
self,
past_episodes: list[dict],
user_facts: dict
) -> str:
parts = []

if user_facts:
facts_text = "\n".join(f"- {k}: {v}" for k, v in user_facts.items()
if k != "last_updated")
parts.append(f"Known information about this user:\n{facts_text}")

if past_episodes:
episodes_text = "\n".join(
f"- {ep.get('summary', 'past conversation')}"
for ep in past_episodes
)
parts.append(f"Relevant past conversations:\n{episodes_text}")

return "\n\n".join(parts) if parts else "No prior conversation history."

LangChain Memory Types

LangChain provides several built-in memory implementations. Here is how the key ones work and when to use each.

from langchain.memory import (
ConversationBufferMemory,
ConversationSummaryMemory,
ConversationSummaryBufferMemory,
VectorStoreRetrieverMemory,
)
from langchain_anthropic import ChatAnthropic
from langchain.chains import ConversationChain

llm = ChatAnthropic(model="claude-opus-4-6")


# ── 1. Buffer Memory: Keep all messages ────────────────────────────────────────

buffer_memory = ConversationBufferMemory()
# Use for: short conversations, debugging
# Problem: grows unbounded, fills context window


# ── 2. Summary Memory: Compress everything ─────────────────────────────────────

summary_memory = ConversationSummaryMemory(
llm=llm,
max_token_limit=500 # Summarize when history exceeds this
)
# Use for: long conversations where you need a coherent summary
# Problem: lossy - specific details may be dropped in summarization


# ── 3. Summary Buffer Memory: Best of both ────────────────────────────────────

summary_buffer_memory = ConversationSummaryBufferMemory(
llm=llm,
max_token_limit=1000 # Keep recent messages; summarize older ones
)
# Use for: production conversational agents
# Keeps recent messages verbatim, compresses older ones into summary
# This is the most practical choice for most applications


# ── 4. Vector Store Retriever Memory ─────────────────────────────────────────

from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = Chroma(embedding_function=embeddings)

retriever_memory = VectorStoreRetrieverMemory(
retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
memory_key="relevant_history"
)
# Use for: long-term memory across many sessions
# Retrieves only semantically relevant past messages, not all of them


# ── Building a chain with memory ──────────────────────────────────────────────

chain = ConversationChain(
llm=llm,
memory=summary_buffer_memory,
verbose=True
)

# Conversation persists across calls
response1 = chain.predict(input="My name is Alice and I work at TechCorp.")
response2 = chain.predict(input="What company do I work at?")
# Response2 will correctly say "TechCorp" because it's in memory

MemGPT: Virtual Context Management

MemGPT (Packer et al., 2023) treats the LLM's context window like an operating system treats RAM - with a virtual memory system that pages information in and out.

The core idea: divide the context window into zones:

  • System prompt zone: always present (agent persona, core instructions)
  • Working context zone: currently active information
  • FIFO queue zone: recent conversation history
  • Recall storage: vector database (external, retrieved on demand)
  • Archival storage: unlimited external storage (search on demand)

The agent has special tools to manage its own memory:

  • recall_memory_search(query) - search past conversations
  • archival_memory_search(query) - search long-term storage
  • archival_memory_insert(content) - save something to long-term storage
  • core_memory_replace(old, new) - update the core facts about the user
# Simplified MemGPT-style memory management

class VirtualContextManager:
"""
Manages the agent's context window like an OS manages RAM.
The agent can explicitly control what stays in context.
"""

MEMORY_TOOLS = [
{
"name": "search_recall",
"description": "Search past conversation history for relevant information.",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "What to search for"}
},
"required": ["query"]
}
},
{
"name": "update_core_memory",
"description": "Update a key fact in permanent memory about the user or current task.",
"input_schema": {
"type": "object",
"properties": {
"key": {"type": "string", "description": "Memory key"},
"value": {"type": "string", "description": "New value"}
},
"required": ["key", "value"]
}
},
{
"name": "archive_to_longterm",
"description": "Save important information to long-term archival storage.",
"input_schema": {
"type": "object",
"properties": {
"content": {"type": "string", "description": "Content to archive"},
"tags": {
"type": "array",
"items": {"type": "string"},
"description": "Tags for retrieval"
}
},
"required": ["content"]
}
}
]

def __init__(self):
self.core_memory: dict[str, str] = {}
self.recall_store = EpisodicMemory(collection_name="recall")
self.archive: list[dict] = []

def get_system_context(self) -> str:
"""Generate the memory-aware system prompt."""
core_facts = "\n".join(
f"- {k}: {v}" for k, v in self.core_memory.items()
) or "No core facts stored yet."

return f"""You have a virtual memory system. Your context window is limited.

CORE MEMORY (always available):
{core_facts}

MEMORY TOOLS:
You can use memory tools to retrieve past context or update stored facts.
- search_recall: Find relevant past conversations
- update_core_memory: Update a permanent fact
- archive_to_longterm: Save something important for later

Use these tools proactively to maintain continuity across conversations.
"""

def execute_memory_tool(self, tool_name: str, tool_input: dict) -> str:
if tool_name == "search_recall":
results = self.recall_store.retrieve_relevant_episodes(
user_id="default",
current_query=tool_input["query"],
top_k=3
)
if not results:
return "No relevant past conversations found."
return "\n\n".join(
ep.get("summary", "") for ep in results
)

elif tool_name == "update_core_memory":
key = tool_input["key"]
value = tool_input["value"]
self.core_memory[key] = value
return f"Core memory updated: {key} = {value}"

elif tool_name == "archive_to_longterm":
entry = {
"content": tool_input["content"],
"tags": tool_input.get("tags", []),
"timestamp": datetime.utcnow().isoformat()
}
self.archive.append(entry)
return f"Archived successfully. Total archive size: {len(self.archive)}"

return f"Unknown memory tool: {tool_name}"

Memory Privacy and Production Considerations

Data Retention Policies

Not all memory should be kept forever. Implement retention policies:

from datetime import timedelta

class MemoryRetentionPolicy:
def __init__(
self,
episodic_retention_days: int = 90,
semantic_retention_days: int = 365,
pii_detection: bool = True
):
self.episodic_retention = timedelta(days=episodic_retention_days)
self.semantic_retention = timedelta(days=semantic_retention_days)
self.pii_detection = pii_detection

def should_store(self, content: str) -> tuple[bool, str]:
"""Check if content should be stored (GDPR compliance)."""
if self.pii_detection:
pii_patterns = [
r'\b\d{16}\b', # Credit card numbers
r'\b\d{3}-\d{2}-\d{4}\b', # SSN
r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', # Email
]
import re
for pattern in pii_patterns:
if re.search(pattern, content):
return False, "Contains PII that should not be stored"
return True, "OK"

def redact_pii(self, content: str) -> str:
"""Redact PII before storing."""
import re
content = re.sub(r'\b\d{16}\b', '[CARD NUMBER REDACTED]', content)
content = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN REDACTED]', content)
return content

GDPR: Right to Erasure

class GDPRCompliantMemory:
def delete_user_data(self, user_id: str) -> dict:
"""Implement GDPR right to erasure."""
deleted = {
"episodic_episodes": 0,
"entity_facts": 0,
"archived_items": 0
}

# Delete from vector store
# (Implementation depends on your vector DB)
# self.chroma_collection.delete(where={"user_id": user_id})
deleted["episodic_episodes"] = 1 # Stub

# Delete entity facts
if user_id in self._entities:
del self._entities[user_id]
deleted["entity_facts"] = 1

return deleted

Common Mistakes

:::danger Storing Entire Raw Conversations Without Summarization Storing full conversation transcripts as single vector embeddings produces poor retrieval. The embedding represents the average of the entire conversation, which may not match any specific query. Chunk conversations into meaningful segments and store each separately. :::

:::danger Loading All Past Conversations Into Context Retrieving and injecting all past conversations for a user defeats the purpose of episodic memory. Always use similarity search to retrieve only the top-k most relevant past episodes. The user may have had 1,000 conversations - only 2-3 are relevant right now. :::

:::warning Memory Without Forgetting Real cognitive systems forget irrelevant information. An agent that remembers everything equally becomes less useful over time as old, irrelevant memories crowd out new, relevant ones. Implement decay functions or time-bounded retrieval. :::

:::warning Not Handling the "Lost in the Middle" Problem Research shows that LLMs perform worse on information buried in the middle of long contexts. When injecting retrieved memories, put the most relevant ones at the beginning or end of the context, not the middle. Or use structured formats that make key facts easy to find. :::


Interview Q&A

Q: What are the four types of agent memory and how do they differ?

In-context (working) memory is the current prompt - fast, limited, ephemeral. Episodic memory is stored records of past interactions retrieved by semantic similarity - like a diary. Semantic memory is a knowledge base of facts and documents - like an encyclopedia. Procedural memory is baked into model weights through pre-training or fine-tuning - skills that don't need to be retrieved. In production agents, you primarily design and manage episodic and semantic memory; the other two are either automatic (in-context) or expensive to change (procedural).

Q: How do you decide when to retrieve from episodic memory vs. semantic memory?

Episodic memory is for personalization: what did this specific user say, what issues did they have, what preferences did they express? Semantic memory is for domain knowledge: what do your product docs say, what does the API reference contain, what are the company policies? Retrieve from both at the start of each session - episodic for user context, semantic for task context. The queries differ: episodic retrieval uses the current message as the query, semantic retrieval uses the user's intent as the query.

Q: What is the "lost in the middle" problem and how does it affect agent memory design?

LLMs show degraded performance on information placed in the middle of long contexts (Liu et al., 2023). If you retrieve 5 relevant memories and inject them as a block in the middle of the prompt, the model may not use the middle ones effectively. Mitigate by: (1) placing most important memories at the beginning or end, (2) using structured formats with clear labels, (3) limiting the total number of injected memories, (4) using more capable models with better long-context performance.

Q: How does MemGPT work differently from standard memory approaches?

Standard approaches inject retrieved memories passively - you retrieve and paste. MemGPT gives the agent active control over its own memory: the agent can search, store, and update memories as part of its normal tool use. This mirrors how humans consciously decide what to remember. The agent can say "this user's name is Alice, I should store that in core memory" and explicitly write it to persistent storage. This is more flexible but requires the model to use memory tools appropriately, which not all models do reliably.

Q: What privacy considerations matter for agent memory systems?

Key considerations: data retention (how long do you keep memories?), PII handling (never store unredacted credit cards, SSNs, or health information), right to erasure (GDPR requires you be able to delete all data for a user), access control (one user should not be able to retrieve another user's memories), and data minimization (store what is needed, not everything). Always encrypt stored memories at rest and in transit. Audit log all memory reads and writes.

Q: How would you implement memory for a multi-tenant agent system?

Use strict user-scoped storage with user_id as a mandatory filter on all queries. Never use a global vector collection without user-level filtering - this creates a risk of cross-user retrieval. Implement per-user namespace isolation in your vector database. Use separate encryption keys per user for at-rest encryption. Implement rate limiting on memory retrieval to prevent one user from degrading performance for others.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the ReAct Agent demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.