Skip to main content

In-Context Working Memory

Reading time: 30 min  |  Level: Intermediate–Advanced  |  Relevance: AI Engineer, ML Engineer, Backend Engineer

The Finite Window Problem

It is 11pm. A production incident has been escalating for three hours. Your AI coding agent has been in the same conversation the entire time - debugging logs, writing fixes, running tests. Ninety messages in. Fourteen tool calls. Six code revisions.

Then the agent starts forgetting things. It re-introduces a bug it fixed an hour ago. It asks for the database schema it already retrieved. It contradicts a decision made sixty messages back.

The agent is not broken. The context window is full.

Every LLM call starts fresh. The model has no memory outside what you explicitly put in its input. When you send a conversation to the API, you are sending the entire history every single time. As that history grows, three things happen: cost grows linearly, latency grows, and - most dangerously - quality degrades.

Managing the context window is one of the most critical and most underestimated skills in agent engineering. This lesson covers it exhaustively: what goes in the window, how it fills up, how to measure it, and the strategies - sliding window, selective retention, summarization, hierarchical compression - that keep quality high as conversations grow long.


:::tip 🎮 Interactive Playground Visualize this concept: Try the Memory Systems demo on the EngineersOfAI Playground - no code required. :::

Why This Exists: The Stateless API Problem

LLMs are stateless. Each API call is independent. When you call client.messages.create(), the model has no memory of any previous call. You are responsible for providing all context the model needs on every single call.

This design was intentional: statelessness makes models easier to scale horizontally. You can route any call to any instance. No session affinity required.

But it creates a fundamental tension for agent use cases. Agents need continuity. They need to remember what they discussed earlier, what tools they called, what decisions were made. The only way to provide this continuity is to include conversation history in every API call.

The problem: conversation history grows without bound. Every message you add to history costs tokens. The LLM attention mechanism processes all tokens at once - so longer inputs cost more compute and more money. And at some point, you hit the hard limit of the context window.

Pre-attention era (before 2017): RNN-based models maintained hidden states across sequence steps, giving them implicit memory. But this memory was lossy and degraded badly for long sequences - the vanishing gradient problem.

Early transformer era (2017–2022): Attention over all positions solved long-range dependencies within a document but did not solve cross-call persistence. Each inference call was still isolated.

2023 onward: Context windows expanded dramatically - GPT-4 went from 8K to 128K, Claude from 9K to 200K, Gemini to 1M+. But bigger windows revealed a new problem: quality degrades even when relevant content fits in the window. The "lost in the middle" problem meant models performed worse on retrieval tasks when relevant content was in the middle of long contexts.

The community response: context management is not just about fitting more in. It is about managing what goes in carefully.


What Lives in the Context Window

A well-structured context window for an agent looks like this:

┌──────────────────────────────────────────────┐
│ SYSTEM PROMPT │
│ - Agent identity and role │
│ - Permanent rules and constraints │
│ - Tool definitions │
│ - Retrieved memories (injected) │
├──────────────────────────────────────────────┤
│ CONVERSATION HISTORY │
│ [user] → [assistant] → [tool] → [assistant]│
│ [user] → [assistant] → ... │
├──────────────────────────────────────────────┤
│ CURRENT USER MESSAGE │
└──────────────────────────────────────────────┘

System prompt (permanent working memory): Everything the agent needs to know that does not change during the conversation. This is the most valuable real estate in the context window - the model attends to it on every call. Invest time in making it dense with relevant, well-organized information.

Conversation history (short-term working memory): The back-and-forth that has happened so far. Includes user messages, assistant responses, and tool call results. This grows every turn and is where the token pressure comes from.

Tool definitions: If you are using function calling, tool schemas live in the context. Each tool schema costs tokens. A complex tool suite with ten tools might cost 3,000–5,000 tokens.

Retrieved memories: Content pulled from episodic, semantic, and procedural stores. These are dynamically injected and should be included at the top of the system prompt where the model attends most strongly.


Context Window Sizes in Practice

Different models offer different context limits. More is not always better.

ModelContext WindowPractical LimitNotes
GPT-4o128K tokens~80K effectiveQuality degrades in middle of long contexts
Claude Opus 4200K tokens~150K effectiveStronger long-context performance than GPT-4
Gemini 1.5 Pro1M tokens~500K effectiveBest long-context model currently available
Claude Haiku200K tokens~100K effectiveCheap for high-volume agent calls
GPT-4o Mini128K tokens~60K effectiveCost-optimized for simple tasks

Why bigger is not always better:

  1. Attention dilution: The attention mechanism distributes finite attention weight across all tokens. More tokens means less attention per token. Relevant content in a very long context gets less attention than the same content in a shorter context.

  2. Cost: Tokens cost money. A 200K token input at GPT-4o pricing costs roughly 2.00percall.Foranagentmaking50callsinasession,thatis2.00 per call. For an agent making 50 calls in a session, that is 100 in context cost alone.

  3. Latency: Time-to-first-token scales with input length. For interactive agents, latency matters.

  4. False security: Developers with large context windows often skip memory management entirely, then discover quality problems at scale when context is always full.


Measuring Context: Token Counting

Never guess at token counts. Measure them. Different models use different tokenizers.

"""
Token counting across different model providers.
Install: pip install anthropic tiktoken
"""

import anthropic
import tiktoken # For OpenAI models


# ── Anthropic token counting (official API method) ────────────────
def count_anthropic_tokens(
system: str,
messages: list[dict],
model: str = "claude-opus-4-6",
) -> int:
"""
Use the Anthropic API's official token counting endpoint.
More accurate than character-based estimation.
"""
client = anthropic.Anthropic()
response = client.messages.count_tokens(
model=model,
system=system,
messages=messages,
)
return response.input_tokens


# ── OpenAI token counting (via tiktoken) ──────────────────────────
def count_openai_tokens(
messages: list[dict],
model: str = "gpt-4o",
) -> int:
"""
Count tokens for OpenAI models using tiktoken.
Reference: https://github.com/openai/openai-cookbook
"""
try:
encoding = tiktoken.encoding_for_model(model)
except KeyError:
encoding = tiktoken.get_encoding("cl100k_base")

# Every message has 3 overhead tokens; every reply has 3 priming tokens
num_tokens = 3
for message in messages:
num_tokens += 3 # role, content, separator
for key, value in message.items():
if isinstance(value, str):
num_tokens += len(encoding.encode(value))
return num_tokens


# ── Fast character-based estimation ──────────────────────────────
def estimate_tokens(text: str) -> int:
"""
Fast estimation: 4 characters ≈ 1 token (English text).
Accurate to ±15%. Use for quick budget checks.
"""
return len(text) // 4


# ── Context budget monitor ─────────────────────────────────────────

class ContextBudgetMonitor:
"""
Tracks token usage against a configurable budget.
Warns at 80%, blocks at 95%.
"""

def __init__(
self,
max_tokens: int = 32_000,
warn_threshold: float = 0.80,
hard_limit: float = 0.95,
):
self.max_tokens = max_tokens
self.warn_threshold = warn_threshold
self.hard_limit = hard_limit

def check(
self,
system: str,
messages: list[dict],
) -> dict:
"""Check current usage and return status report."""
# Use character estimation for speed; swap with API count for precision
used = estimate_tokens(system) + sum(
estimate_tokens(m.get("content", "")) for m in messages
)
ratio = used / self.max_tokens

status = "ok"
if ratio >= self.hard_limit:
status = "critical"
elif ratio >= self.warn_threshold:
status = "warning"

return {
"used_tokens": used,
"max_tokens": self.max_tokens,
"utilization": ratio,
"status": status,
"remaining": self.max_tokens - used,
}

def should_compress(self, system: str, messages: list[dict]) -> bool:
report = self.check(system, messages)
return report["status"] in ("warning", "critical")


# Demo
if __name__ == "__main__":
monitor = ContextBudgetMonitor(max_tokens=32_000)
system = "You are a helpful AI assistant."
messages = [
{"role": "user", "content": "Hello, help me with Python."},
{"role": "assistant", "content": "Sure, what specifically?"},
]
report = monitor.check(system, messages)
print(f"Context report: {report}")

Context Management Strategies

There is no single right strategy. The correct choice depends on the task type, session length, and quality requirements. Most production agents combine multiple strategies.

Strategy 1: Sliding Window

Keep the last N messages and drop everything older. Simple, predictable, zero additional LLM calls.

Use when: Conversations are largely sequential and recent context is sufficient. Customer service bots, coding assistants for short tasks.

Risk: Important early context (user's stated goal, agreed decisions) gets dropped.

Mitigation: Never drop the first 2–3 messages where goals are usually stated.

def sliding_window(
messages: list[dict],
max_messages: int = 20,
always_keep_first: int = 3,
) -> list[dict]:
"""
Keep the most recent messages, preserving the opening context.

Args:
messages: Full conversation history
max_messages: Maximum messages to retain
always_keep_first: Never drop the first N messages (goals, instructions)
"""
if len(messages) <= max_messages:
return messages

# Always preserve opening messages (goals, context, decisions)
preserved = messages[:always_keep_first]
# Keep the most recent (max_messages - always_keep_first) messages
recent_count = max_messages - always_keep_first
recent = messages[-recent_count:]

return preserved + recent

Strategy 2: Importance-Based Selective Retention

Score each message by importance and retain the highest-scoring ones. More intelligent than sliding window - keeps what matters, not just what is recent.

Use when: Conversations have a mix of high-signal messages (decisions, errors, key facts) and low-signal messages (acknowledgments, pleasantries, intermediate steps).

import re

def compute_message_importance(message: dict) -> float:
"""
Heuristic importance score for a conversation message.
Production: replace with LLM-based scoring for higher accuracy.
"""
content = message.get("content", "")
if not isinstance(content, str):
content = str(content)

score = 0.5 # Baseline

# High-signal keywords suggest important content
high_signal = [
"error", "failed", "exception", "traceback",
"decision", "requirement", "must", "never",
"important", "critical", "constraint",
"user wants", "goal is", "objective",
]
for keyword in high_signal:
if keyword in content.lower():
score += 0.1

# Longer messages tend to be more substantive
word_count = len(content.split())
if word_count > 200:
score += 0.15
elif word_count > 50:
score += 0.05

# Code blocks are always important
if "```" in content:
score += 0.2

# Tool results (often contain factual data)
if message.get("role") == "tool":
score += 0.15

# Cap at 1.0
return min(score, 1.0)


def selective_retention(
messages: list[dict],
target_count: int = 15,
always_keep_first: int = 3,
always_keep_last: int = 5,
) -> list[dict]:
"""
Retain the highest-importance messages within a target count.
Always preserves opening and recent messages.
"""
if len(messages) <= target_count:
return messages

# Fixed anchors - never drop these
first_messages = messages[:always_keep_first]
last_messages = messages[-always_keep_last:]
middle = messages[always_keep_first:-always_keep_last]

# Score and sort middle messages
middle_slots = target_count - always_keep_first - always_keep_last
if middle_slots <= 0 or not middle:
return first_messages + last_messages

scored = [(compute_message_importance(m), i, m) for i, m in enumerate(middle)]
# Sort by score (descending), then by original index (ascending) to preserve chronological order
scored.sort(key=lambda x: (-x[0], x[1]))
kept = sorted(scored[:middle_slots], key=lambda x: x[1])
kept_messages = [m for _, _, m in kept]

return first_messages + kept_messages + last_messages

Strategy 3: LLM-Based Summarization

When context gets large, use a separate (cheap) LLM call to compress old messages into a summary. Inject the summary in place of the dropped messages.

Use when: Sessions run very long (50+ turns). Tasks where context from early in the conversation remains relevant much later.

Cost: One extra LLM call per summarization event. Use a cheap model (Claude Haiku, GPT-4o Mini) for summarization.

import anthropic


class SummarizationCompressor:
"""
Compresses old conversation history using an LLM.
Uses a cheap fast model for summarization calls.
"""

SUMMARIZATION_MODEL = "claude-haiku-4-5"
SUMMARIZATION_PROMPT = """You are compressing conversation history for an AI agent.
Summarize the provided conversation messages into a concise block that preserves:
1. All decisions made and their rationale
2. Key facts discovered or established
3. Errors encountered and how they were resolved
4. The user's goals and stated preferences
5. Current state of any ongoing tasks

Be dense and precise. Omit small talk, acknowledgments, and intermediate exploration that led nowhere.
Output only the summary, no preamble."""

def __init__(self):
self.client = anthropic.Anthropic()
self.summary_history: list[str] = [] # Track all past summaries

def summarize_messages(self, messages: list[dict]) -> str:
"""Compress a list of messages into a summary string."""
# Format messages for the summarization prompt
conversation_text = "\n".join(
f"[{m['role'].upper()}]: {m.get('content', '')}"
for m in messages
)

response = self.client.messages.create(
model=self.SUMMARIZATION_MODEL,
max_tokens=512,
system=self.SUMMARIZATION_PROMPT,
messages=[{"role": "user", "content": conversation_text}],
)
return response.content[0].text

def compress_history(
self,
messages: list[dict],
keep_recent: int = 10,
compress_threshold: int = 30,
) -> list[dict]:
"""
Compress messages older than keep_recent into a summary block.
Returns modified message list with summary prepended.
"""
if len(messages) < compress_threshold:
return messages

# Split: old messages to compress, recent messages to keep
compress_count = len(messages) - keep_recent
to_compress = messages[:compress_count]
to_keep = messages[-keep_recent:]

print(f"[Summarizer] Compressing {compress_count} messages → summary block")
summary = self.summarize_messages(to_compress)
self.summary_history.append(summary)

# Inject summary as a system-level context block at the start
summary_message = {
"role": "user",
"content": (
f"[CONVERSATION HISTORY SUMMARY]\n"
f"The following summarizes earlier parts of our conversation:\n\n"
f"{summary}\n\n"
f"[END OF SUMMARY - continuing conversation below]"
),
}
# Add a matching assistant acknowledgment to maintain alternating roles
ack_message = {
"role": "assistant",
"content": "Understood. I have reviewed the conversation history summary.",
}

return [summary_message, ack_message] + to_keep

Strategy 4: Hierarchical Compression

For very long sessions (100+ turns), maintain multiple compression levels: detailed recent messages, compressed middle section, and a high-level abstract summary of the full session.

from dataclasses import dataclass, field


@dataclass
class HierarchicalContext:
"""
Multi-level context structure for very long agent sessions.

Level 0 (session_summary): One paragraph summary of the entire session
Level 1 (period_summaries): Summaries of session periods (every 30 messages)
Level 2 (recent_messages): Full detail of the last N messages
"""
session_summary: str = ""
period_summaries: list[str] = field(default_factory=list)
recent_messages: list[dict] = field(default_factory=list)
total_messages_processed: int = 0

def to_context_messages(self) -> list[dict]:
"""Convert hierarchical structure to flat message list for API call."""
context_parts = []

if self.session_summary:
context_parts.append(
f"[SESSION OVERVIEW]\n{self.session_summary}"
)

if self.period_summaries:
periods = "\n\n".join(
f"Period {i+1}:\n{s}"
for i, s in enumerate(self.period_summaries)
)
context_parts.append(f"[EARLIER CONVERSATION PERIODS]\n{periods}")

intro_block = {
"role": "user",
"content": "\n\n---\n\n".join(context_parts) + "\n\n[RECENT CONVERSATION FOLLOWS]",
}
ack_block = {
"role": "assistant",
"content": "Understood. I am reviewing the full context.",
}

if context_parts:
return [intro_block, ack_block] + self.recent_messages
return self.recent_messages

def token_estimate(self) -> int:
total_chars = len(self.session_summary)
total_chars += sum(len(s) for s in self.period_summaries)
total_chars += sum(len(m.get("content", "")) for m in self.recent_messages)
return total_chars // 4

System Prompt Design: Permanent Working Memory

The system prompt is the most important part of the context window. Unlike conversation history, it does not grow over time. It is permanent for the session. Treat it as your agent's constitution - dense, clear, carefully structured.

What belongs in the system prompt:

SYSTEM_PROMPT_TEMPLATE = """
You are {agent_name}, an AI assistant specializing in {domain}.

## Identity and Role
{role_description}

## Core Constraints (NEVER violate these)
- {constraint_1}
- {constraint_2}

## User Context
{user_profile} ← injected from episodic memory

## Domain Knowledge
{relevant_facts} ← injected from semantic memory

## Applicable Workflow
{skill_steps} ← injected from procedural memory

## Response Style
- Be concise but complete
- Use {preferred_format} format
- When uncertain, say so explicitly

## Current Date and Context
Date: {current_date}
Session ID: {session_id}
"""

What does not belong in the system prompt:

  • Temporary task state (put in conversation history)
  • Full document content (too large; use RAG and inject excerpts)
  • Speculative instructions ("you might need to...")
  • Redundant examples (one example per behavior is enough)

The Lost-in-the-Middle Problem in Practice

Liu et al. (2023) ran experiments where they placed a relevant document at different positions in a long context (beginning, middle, end) and measured retrieval accuracy. Results were stark:

  • Beginning of context: ~95% accuracy
  • End of context: ~90% accuracy
  • Middle of context: ~60% accuracy

For agents, this means the order of injected memory content matters as much as what you inject.

Placement heuristics:

  • The single most critical piece of information should go at the very top of the system prompt or the very bottom of the message history (just before the user message).
  • If injecting multiple memory blocks, place the most task-relevant one closest to the current message.
  • Avoid the dead zone: roughly the middle third of a long context window.

Full Implementation: Production Context Manager

"""
Production-grade context manager for long-running agent sessions.
Combines token budgeting, sliding window, importance scoring,
and LLM-based summarization.
"""

from __future__ import annotations
import time
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
import anthropic


class CompressionStrategy(Enum):
NONE = "none"
SLIDING_WINDOW = "sliding_window"
IMPORTANCE = "importance"
SUMMARIZE = "summarize"


@dataclass
class ContextConfig:
"""Configuration for the context manager."""
max_tokens: int = 32_000
warn_at: float = 0.70 # Start sliding window
summarize_at: float = 0.85 # Trigger LLM summarization
keep_recent: int = 12 # Messages always preserved (recent)
keep_first: int = 4 # Messages always preserved (opening)
summarization_model: str = "claude-haiku-4-5"
main_model: str = "claude-opus-4-6"


class AgentContextManager:
"""
Manages context window for long-running agent sessions.

Features:
- Real-time token budget monitoring
- Automatic compression strategy selection
- LLM-based summarization with compression history
- Importance-based selective retention
- Detailed diagnostics and logging
"""

def __init__(self, config: ContextConfig | None = None):
self.config = config or ContextConfig()
self.client = anthropic.Anthropic()

# State
self.system_prompt: str = ""
self.messages: list[dict] = []
self.summaries: list[dict] = [] # History of compressions
self.turn_count: int = 0
self.compression_count: int = 0

def set_system_prompt(self, prompt: str) -> None:
self.system_prompt = prompt

def add_user_message(self, content: str) -> None:
self.messages.append({"role": "user", "content": content})
self.turn_count += 1

def add_assistant_message(self, content: str) -> None:
self.messages.append({"role": "assistant", "content": content})

def add_tool_result(self, tool_name: str, result: str) -> None:
"""Add tool call result as a user message."""
self.messages.append({
"role": "user",
"content": f"[Tool: {tool_name}]\n{result}",
})

def _estimate_tokens(self) -> int:
chars = len(self.system_prompt)
chars += sum(len(str(m.get("content", ""))) for m in self.messages)
return chars // 4

def _utilization(self) -> float:
return self._estimate_tokens() / self.config.max_tokens

def _select_strategy(self) -> CompressionStrategy:
u = self._utilization()
if u >= self.config.summarize_at:
return CompressionStrategy.SUMMARIZE
elif u >= self.config.warn_at:
return CompressionStrategy.SLIDING_WINDOW
return CompressionStrategy.NONE

def _apply_sliding_window(self) -> None:
keep_first = self.config.keep_first
keep_recent = self.config.keep_recent
total_keep = keep_first + keep_recent

if len(self.messages) <= total_keep:
return

dropped = len(self.messages) - total_keep
first = self.messages[:keep_first]
recent = self.messages[-keep_recent:]
self.messages = first + recent
print(f"[ContextMgr] Sliding window: dropped {dropped} messages")

def _apply_summarization(self) -> None:
"""Summarize old messages and replace with compressed block."""
keep_first = self.config.keep_first
keep_recent = self.config.keep_recent

compress_end = len(self.messages) - keep_recent
if compress_end <= keep_first:
# Not enough messages to compress
self._apply_sliding_window()
return

to_compress = self.messages[keep_first:compress_end]
preserved_first = self.messages[:keep_first]
preserved_recent = self.messages[-keep_recent:]

# Summarize the middle block
conv_text = "\n".join(
f"[{m['role'].upper()}]: {str(m.get('content', ''))[:500]}"
for m in to_compress
)
summary_response = self.client.messages.create(
model=self.config.summarization_model,
max_tokens=600,
system=(
"Summarize the following conversation history for an AI agent. "
"Preserve: decisions made, errors found and resolved, key facts learned, "
"user preferences, and current task state. Be dense and specific. "
"Omit pleasantries, redundant explorations, and resolved dead ends."
),
messages=[{"role": "user", "content": conv_text}],
)
summary_text = summary_response.content[0].text

self.compression_count += 1
self.summaries.append({
"timestamp": time.time(),
"messages_compressed": len(to_compress),
"summary": summary_text,
})

summary_block = {
"role": "user",
"content": (
f"[COMPRESSED HISTORY - Block {self.compression_count}]\n"
f"{summary_text}\n"
f"[END COMPRESSED HISTORY]"
),
}
ack_block = {
"role": "assistant",
"content": "Understood, I have processed the conversation history summary.",
}

self.messages = preserved_first + [summary_block, ack_block] + preserved_recent
print(
f"[ContextMgr] Summarized {len(to_compress)} messages → "
f"compression block #{self.compression_count}"
)

def maybe_compress(self) -> None:
"""Check and apply compression if needed."""
strategy = self._select_strategy()
if strategy == CompressionStrategy.NONE:
return
elif strategy == CompressionStrategy.SLIDING_WINDOW:
self._apply_sliding_window()
elif strategy == CompressionStrategy.SUMMARIZE:
self._apply_summarization()

def call(self, user_message: str) -> str:
"""
Add user message, compress if needed, call LLM, store response.
Returns assistant response text.
"""
self.add_user_message(user_message)

# Check and apply compression before the API call
self.maybe_compress()

before_tokens = self._estimate_tokens()
print(
f"[ContextMgr] Turn {self.turn_count}: "
f"~{before_tokens:,} tokens ({self._utilization():.0%} utilized)"
)

response = self.client.messages.create(
model=self.config.main_model,
max_tokens=1024,
system=self.system_prompt,
messages=self.messages,
)
assistant_text = response.content[0].text
self.add_assistant_message(assistant_text)
return assistant_text

def diagnostics(self) -> dict:
return {
"turn_count": self.turn_count,
"current_messages": len(self.messages),
"estimated_tokens": self._estimate_tokens(),
"utilization": f"{self._utilization():.1%}",
"compressions_applied": self.compression_count,
"strategy": self._select_strategy().value,
}


# ── Demo ──────────────────────────────────────────────────────────

def demo_long_conversation():
"""
Simulate a long agent session that triggers automatic compression.
"""
config = ContextConfig(
max_tokens=8_000, # Small limit to trigger compression quickly in demo
warn_at=0.65,
summarize_at=0.80,
keep_recent=8,
keep_first=2,
)

mgr = AgentContextManager(config)
mgr.set_system_prompt(
"You are a Python debugging assistant. Help users find and fix bugs in their code. "
"Be concise. Reference previous context when relevant."
)

# Simulate a realistic debugging session over many turns
conversation_turns = [
"Hi, I'm having issues with my Flask app. Getting 500 errors.",
"The error is in the user authentication module. Here's the traceback: AttributeError: 'NoneType' has no attribute 'id'",
"The current_user check is failing. I'm using Flask-Login.",
"Can you show me how to properly protect routes with @login_required?",
"That worked for the auth routes. Now getting a database connection issue.",
"The error is: psycopg2.OperationalError: could not connect to server",
"The DB_URL env var is set. Should I check the connection pool settings?",
"Setting pool_pre_ping=True fixed the connection drops. Thanks.",
"New issue: the user registration endpoint is returning 200 but not saving to DB.",
"Here's the registration handler - is there a commit missing?",
"Yes! Adding db.session.commit() fixed it. Now testing the password reset flow.",
"Password reset emails are sending but the token expires too fast.",
"Setting RESET_TOKEN_EXPIRY to 3600 seconds. Can you help me test end-to-end?",
"All tests passing. Can you summarize what we fixed today?",
]

print("=" * 60)
print("CONTEXT MANAGER - LONG SESSION DEMO")
print("=" * 60)

for turn_num, user_message in enumerate(conversation_turns, 1):
print(f"\n--- Turn {turn_num} ---")
response = mgr.call(user_message)
print(f"User: {user_message[:70]}...")
print(f"Agent: {response[:120]}...")

if turn_num % 5 == 0:
diag = mgr.diagnostics()
print(f"\n[Diagnostics] {diag}")

print("\n\n=== FINAL DIAGNOSTICS ===")
print(mgr.diagnostics())


if __name__ == "__main__":
demo_long_conversation()

Practical Token Budget Allocation

Here is a practical token budget for different agent types:

Customer Support Agent (32K context)

System prompt + tools: 4,000 tokens (12.5%)
User profile (episodic): 1,000 tokens (3.1%)
Product knowledge (RAG): 3,000 tokens (9.4%)
Conversation history: 20,000 tokens (62.5%)
Response buffer: 4,000 tokens (12.5%)

Coding Agent (128K context)

System prompt + tools: 5,000 tokens (3.9%)
Codebase context (RAG): 30,000 tokens (23.4%)
File contents (in-context):40,000 tokens (31.3%)
Conversation history: 40,000 tokens (31.3%)
Response buffer: 13,000 tokens (10.2%)

Research Agent (200K context)

System prompt + tools: 5,000 tokens (2.5%)
Retrieved papers (RAG): 60,000 tokens (30%)
Structured notes: 50,000 tokens (25%)
Conversation history: 70,000 tokens (35%)
Response buffer: 15,000 tokens (7.5%)

:::danger Never Truncate Mid-Message When applying any compression strategy, always operate on complete messages - never truncate a message at an arbitrary character limit. Truncated messages confuse the model (incomplete assistant responses look like errors; incomplete tool results look like failures). Always drop entire messages, not partial ones. :::

:::warning The Summarization Latency Cost LLM-based summarization adds 1–3 seconds per compression event. This is invisible if you trigger it proactively between turns. It is highly visible if you trigger it during a turn as the user waits. Either trigger compression at the end of a turn (after the response is delivered) or run it asynchronously while streaming the current response. :::


Interview Questions and Answers

Q: What is the "lost in the middle" problem and how does it affect agent context management?

A: Liu et al. (2023) demonstrated empirically that LLMs perform significantly worse when relevant information is placed in the middle of long contexts, compared to the beginning or end. Accuracy on document retrieval tasks dropped from ~95% at the start to ~60% in the middle. For agents, this means naive context management - prepending all retrieved memories and conversation history in one long block - yields suboptimal results. Practical mitigations: place the most task-critical information at the top of the system prompt or at the bottom of the message history (just before the current user message), keep context as short as possible through aggressive compression, and use explicit formatting headers to help the model locate relevant sections.

Q: How do you choose between sliding window, importance-based retention, and summarization?

A: They serve different session characteristics. Sliding window is appropriate when recent context is sufficient and information access patterns are sequential - later turns rarely need information from much earlier. Use it for customer support and simple task agents. Importance-based retention works better when conversations have highly variable information density - some messages contain critical decisions, others are just acknowledgments. Score and selectively retain the high-value messages. Summarization is appropriate for long, stateful sessions where early decisions remain relevant much later - research sessions, complex multi-step projects. The cost is extra LLM calls, so use a cheap model for compression. In practice, combine all three: sliding window as a baseline, importance scoring to improve what the window keeps, and periodic summarization to compress older blocks when utilization is high.

Q: How do you handle tool results in context management? They can be very large.

A: Tool results are often the biggest context consumers - a database query might return 10,000 rows, a file read might return 50K characters. Strategies: (1) Truncate at a character limit before storing in messages: "Showing first 2,000 of 47,382 characters." (2) Extract and store only the relevant portion - have the agent extract the answer from a large result before storing the result in history. (3) Mark tool results as low-importance in selective retention scoring - tool results are usually consumed immediately and rarely need to be referenced later. (4) For very large file contents, store them in a separate in-memory cache keyed by filename, and inject only excerpts into context when requested. Never store entire large documents in conversation history.

Q: At what utilization percentage should you trigger context compression?

A: Start compression earlier than you think. I recommend triggering light compression (sliding window) at 65–70% utilization, and heavy compression (LLM summarization) at 80–85%. Reasons: (1) The model quality starts degrading before you hit the hard limit. (2) You need headroom for the assistant's response tokens - a response of 2,000 tokens needs 2,000 tokens of remaining capacity. (3) Compression itself adds some tokens (the summary block and its acknowledgment). Triggering at 95% leaves no room. The exact thresholds depend on your max_tokens configuration and average response length.

Q: How do you prevent the summarization model from losing critical information?

A: Several techniques working together. (1) Structured summarization prompt: explicitly tell the model what categories of information to preserve - decisions, errors, facts, user preferences, task state. This dramatically reduces information loss compared to generic summarization. (2) Store raw messages before compression and make them accessible for recovery if needed. (3) Compare token counts before and after summarization - if the summary is more than 80% of the original size, the summarization did not help; try again with a stronger compression directive. (4) For truly critical information (user's primary goal, key constraints), inject it into the permanent system prompt where it will never be compressed out. (5) Track what has been summarized and maintain a "key facts" section in the system prompt that gets updated as important facts are discovered - this acts as a permanent record independent of the summary chain.

© 2026 EngineersOfAI. All rights reserved.