What is 128K context?

A complete production engineering guide for building applications with long-context LLMs - model selection, cost management, prompt structure, multi-turn conversation, and memory-augmented systems.

How does long context production work in practice?

Working with 128K+ Context Windows in Production covers 128K context, long context production, GPT-4o from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/long-context-strategies/practical-128k-context-guide

What is the difference between 128K context and GPT-4o?

See the full breakdown at https://engineersofai.com/docs/llms/long-context-strategies/practical-128k-context-guide

Working with 128K+ Context Windows in Production

From Capability to Deployment

The academic and engineering work covered in previous lessons - RoPE scaling, YaRN, FlashAttention, lost-in-middle mitigation - has produced a generation of models with genuinely usable long-context capabilities. GPT-4o handles 128K reliably. Claude 3.5 Sonnet at 200K performs synthesis tasks that would have been impossible a year ago. Gemini 1.5 Pro at 1M tokens processes complete codebases.

The question now is engineering: how do you build systems that actually exploit these capabilities efficiently? This lesson is the practitioner's guide - model selection, cost management, prompt structure, multi-turn conversation handling, and integration patterns. Everything you need to deploy long-context applications at production quality.

Model Comparison - Choosing the Right Long-Context Model

Capability and Cost Summary (2024)

Model	Max Context	Input Cost ($/M)	Output Cost ($/M)	Notes
GPT-4o	128K	$5.00	$15.00	Fastest frontier model; cache at $2.50
GPT-4o-mini	128K	$0.15	$0.60	Best cost-efficiency; weaker synthesis
Claude 3.5 Sonnet	200K	$3.00	$15.00	Best at complex synthesis; cache at $0.30
Claude 3 Haiku	200K	$0.25	$1.25	Fast and cheap; weaker at long context
Gemini 1.5 Pro	1M	$3.50	$10.50	Best for massive contexts (100K+)
Gemini 1.5 Flash	1M	$0.075	$0.30	Cheapest at scale
Llama-3.1-70B	128K	Self-hosted	-	No API cost; hardware required
Llama-3.1-8B	128K	Self-hosted	-	Lowest quality at long contexts

Prices approximate as of late 2024; verify current pricing before building cost models.

Effective Performance at Different Context Lengths

The marketing context window and the effective usable context window are different. Based on RULER and similar benchmarks:

def context_length_performance_guide():
    """
    Estimated effective performance (normalized to perfect = 100)
    for various models at different context lengths.

    Source: Interpolated from public RULER, LongBench, and NIAH benchmarks.
    These are approximations - always benchmark on your specific task.
    """
    models = {
        "GPT-4o (128K)": {
            8_000: 97, 16_000: 95, 32_000: 92, 64_000: 88, 128_000: 82
        },
        "Claude-3.5-Sonnet (200K)": {
            8_000: 98, 16_000: 97, 32_000: 95, 64_000: 92, 128_000: 88, 200_000: 82
        },
        "Gemini-1.5-Pro (1M)": {
            8_000: 96, 32_000: 94, 128_000: 90, 512_000: 83, 1_000_000: 75
        },
        "Llama-3.1-70B (128K)": {
            8_000: 93, 16_000: 90, 32_000: 85, 64_000: 77, 128_000: 68
        },
        "Llama-3.1-8B (128K)": {
            8_000: 88, 16_000: 83, 32_000: 75, 64_000: 63, 128_000: 52
        },
    }

    print("Estimated effective performance by context length:")
    print(f"{'Model':<35} | {'8K':>5} | {'32K':>5} | {'128K':>5}")
    print("-" * 55)
    for model, scores in models.items():
        s8k = scores.get(8_000, "-")
        s32k = scores.get(32_000, "-")
        s128k = scores.get(128_000, "-")
        print(f"{model:<35} | {str(s8k):>5} | {str(s32k):>5} | {str(s128k):>5}")

context_length_performance_guide()

Model Selection Decision Tree

Prompt Structure for Long-Context Applications

Prompt structure significantly affects model performance at long contexts. The positioning and framing of components within a 100K+ token prompt matters.

The Recommended Structure

[1. System/Role Instructions] - SHORT, SPECIFIC
[2. Query Definition] - What you're asking for
[3. Critical Information] - Information that MUST be attended to
[4. Supporting Documents] - Bulk of the context
[5. Critical Information (repeated)] - Mirror of item 3 (primacy + recency exploit)
[6. Final Query] - The actual question (again, for recency benefit)

def build_long_context_prompt(
    role_instruction: str,
    query: str,
    critical_info: str,
    documents: list[dict],  # [{"title": str, "content": str, "score": float}]
    max_total_tokens: int = 120_000,
    critical_info_at_end: bool = True,
) -> str:
    """
    Build a production-quality long-context prompt.

    Applies primacy + recency benefits by placing critical information
    at both the beginning and end of the context.
    """
    # Sort documents by relevance score (best at boundaries)
    n = len(documents)
    sorted_docs = sorted(documents, key=lambda x: -x.get("score", 0))

    # Boundary placement: alternate top documents to front and back
    ordered = [None] * n
    front, back = 0, n - 1
    for i, doc in enumerate(sorted_docs):
        if i % 2 == 0:
            ordered[front] = doc
            front += 1
        else:
            ordered[back] = doc
            back -= 1

    # Build prompt sections
    system_section = f"# Instructions\n{role_instruction}\n\n"
    query_section = f"# Task\n{query}\n\n"
    critical_section = f"# Key Information\n{critical_info}\n\n" if critical_info else ""

    doc_parts = []
    current_tokens = (
        len(system_section + query_section + critical_section) // 4
        + (len(critical_section) // 4 if critical_info_at_end else 0)  # reserve space for end repeat
        + 1000  # buffer for final query
    )

    for i, doc in enumerate(ordered):
        if doc is None:
            continue
        doc_text = f"### Document {i+1}: {doc.get('title', f'Source {i+1}')}\n{doc['content']}\n\n"
        doc_tokens = len(doc_text) // 4
        if current_tokens + doc_tokens > max_total_tokens:
            break
        doc_parts.append(doc_text)
        current_tokens += doc_tokens

    context_section = "# Context Documents\n" + "".join(doc_parts)

    # Final section: repeat critical info + query for recency benefit
    end_section = ""
    if critical_info_at_end and critical_info:
        end_section += f"\n# Reminder: Key Information\n{critical_info}\n\n"
    end_section += f"# Your Task (answer below)\n{query}\n\n## Answer:"

    return system_section + query_section + critical_section + context_section + end_section


# Example usage
prompt = build_long_context_prompt(
    role_instruction=(
        "You are a legal document analyst. Extract and analyze contract terms precisely. "
        "Cite specific document sections. Note any ambiguities or missing clauses."
    ),
    query=(
        "Identify all indemnification clauses across the provided contracts. "
        "For each clause: (1) which party indemnifies which, (2) what events trigger indemnification, "
        "(3) any caps on liability."
    ),
    critical_info=(
        "Focus particularly on events that occur after the effective date (January 15, 2024). "
        "All three contracts must be analyzed - missing any one will result in an incomplete answer."
    ),
    documents=retrieved_contracts,  # Your retrieved contract documents
    max_total_tokens=100_000,
)

What Goes Where in a Long Prompt

Section	Position	Why
System instructions	First	Primacy effect; sets model behavior early
Restatement of query	Before context	Model knows what to look for while reading
Critical constraints	Before AND after context	Benefits from both primacy and recency
Best-ranked documents	Start and end of document list	Boundary placement for lost-in-middle mitigation
Lower-ranked documents	Middle of document list	Less critical; middle position is acceptable
Final query	Last	Recency effect; model answers what's freshest

Cost Management at Scale

Estimating and Controlling Costs

class LongContextCostTracker:
    """
    Track and estimate costs for long-context LLM applications.
    Includes caching optimization and budget alerts.
    """

    # Approximate prices per million tokens (verify current pricing)
    PRICES = {
        "gpt-4o": {"input": 5.0, "output": 15.0, "cached_input": 2.5},
        "gpt-4o-mini": {"input": 0.15, "output": 0.60, "cached_input": 0.075},
        "claude-3-5-sonnet-20241022": {"input": 3.0, "output": 15.0, "cached_input": 0.30},
        "claude-3-haiku-20240307": {"input": 0.25, "output": 1.25, "cached_input": 0.03},
        "gemini-1.5-pro": {"input": 3.5, "output": 10.5, "cached_input": 0.875},
        "gemini-1.5-flash": {"input": 0.075, "output": 0.30, "cached_input": 0.01875},
    }

    def __init__(
        self,
        model: str,
        daily_budget_usd: float = 100.0,
        alert_threshold: float = 0.8,
    ):
        self.model = model
        self.daily_budget = daily_budget_usd
        self.alert_threshold = alert_threshold
        self.prices = self.PRICES.get(model, {"input": 5.0, "output": 15.0, "cached_input": 2.5})
        self.daily_spend = 0.0
        self.total_queries = 0
        self.cache_hits = 0

    def estimate_cost(
        self,
        input_tokens: int,
        output_tokens: int,
        cached_input_tokens: int = 0,
    ) -> float:
        """Estimate cost of a single API call."""
        fresh_input = input_tokens - cached_input_tokens
        cost = (
            fresh_input / 1e6 * self.prices["input"]
            + cached_input_tokens / 1e6 * self.prices["cached_input"]
            + output_tokens / 1e6 * self.prices["output"]
        )
        return cost

    def record_call(
        self,
        input_tokens: int,
        output_tokens: int,
        cached_input_tokens: int = 0,
    ) -> dict:
        """Record an API call and return cost breakdown."""
        cost = self.estimate_cost(input_tokens, output_tokens, cached_input_tokens)
        self.daily_spend += cost
        self.total_queries += 1
        if cached_input_tokens > 0:
            self.cache_hits += 1

        if self.daily_spend > self.daily_budget * self.alert_threshold:
            print(f"⚠ Budget alert: ${self.daily_spend:.2f} spent of ${self.daily_budget:.2f} budget")

        return {
            "cost_usd": round(cost, 6),
            "daily_spend_usd": round(self.daily_spend, 4),
            "budget_remaining_usd": round(self.daily_budget - self.daily_spend, 4),
            "cache_hit_rate": self.cache_hits / self.total_queries if self.total_queries > 0 else 0,
        }

    def project_daily_cost(self, queries_remaining: int, avg_tokens_per_query: int) -> float:
        """Project remaining daily cost based on historical patterns."""
        avg_cost = self.daily_spend / max(1, self.total_queries)
        return avg_cost * queries_remaining

    def print_summary(self):
        """Print cost tracking summary."""
        print(f"\nCost Summary ({self.model})")
        print(f"  Daily spend: ${self.daily_spend:.4f} / ${self.daily_budget:.2f}")
        print(f"  Total queries: {self.total_queries}")
        print(f"  Cache hit rate: {self.cache_hits/max(1, self.total_queries):.1%}")
        print(f"  Avg cost/query: ${self.daily_spend/max(1, self.total_queries):.6f}")


def optimize_token_usage(
    prompt: str,
    tokenizer,
    target_tokens: int,
    strategy: str = "end",  # "end" = truncate from end, "middle" = remove middle
) -> str:
    """
    Truncate a prompt to target_tokens while preserving the most important parts.

    strategy "end": remove tokens from the end (loses last documents)
    strategy "middle": remove tokens from the middle (preserves boundaries)
    """
    tokens = tokenizer.encode(prompt)
    if len(tokens) <= target_tokens:
        return prompt

    if strategy == "end":
        truncated_tokens = tokens[:target_tokens]
        return tokenizer.decode(truncated_tokens)

    elif strategy == "middle":
        # Keep beginning and end, remove middle
        keep_per_side = target_tokens // 2
        kept = tokens[:keep_per_side] + tokens[-keep_per_side:]
        middle_removed = len(tokens) - target_tokens
        print(f"Middle truncation: removed {middle_removed:,} tokens from the middle")
        return tokenizer.decode(kept)

Caching Strategy

import hashlib
from functools import lru_cache

class PromptCacheManager:
    """
    Manage prompt caching to reduce API costs for repeated long contexts.

    Anthropic and OpenAI both offer prompt caching for prefix segments.
    This class helps structure prompts to maximize cache reuse.
    """

    def __init__(self, cache_prefix_tokens: int = 50_000):
        """
        cache_prefix_tokens: Minimum tokens for a cache-eligible prefix.
        Caching is only beneficial for very long, repeated prefixes.
        """
        self.cache_prefix_tokens = cache_prefix_tokens
        self._cache_hits: dict[str, int] = {}

    def structure_for_caching(
        self,
        static_context: str,
        dynamic_query: str,
    ) -> tuple[str, str]:
        """
        Split prompt into cacheable prefix (static) and dynamic suffix.

        The static_context is the long document/context that doesn't change.
        The dynamic_query changes with each user question.

        Returns (prefix_to_cache, dynamic_suffix)
        """
        # Track cache reuse for cost estimation
        context_hash = hashlib.md5(static_context.encode()).hexdigest()[:8]
        self._cache_hits[context_hash] = self._cache_hits.get(context_hash, 0) + 1

        hit_count = self._cache_hits[context_hash]
        if hit_count == 1:
            print(f"New context [hash:{context_hash}]: first call, no cache saving")
        else:
            print(f"Cache hit [hash:{context_hash}]: call #{hit_count}, prefix reused")

        return static_context, dynamic_query

    def estimate_caching_savings(
        self,
        static_tokens: int,
        n_queries: int,
        model: str = "claude-3-5-sonnet-20241022",
        tracker: LongContextCostTracker | None = None,
    ) -> dict:
        """
        Estimate cost savings from caching a static context.
        """
        prices = LongContextCostTracker.PRICES.get(model, {"input": 5.0, "cached_input": 2.5})

        # Without caching: pay full price n_queries times
        cost_without_cache = n_queries * (static_tokens / 1e6) * prices["input"]

        # With caching: pay full price once, cached price n-1 times
        cost_with_cache = (
            (static_tokens / 1e6) * prices["input"]  # First call: full price
            + (n_queries - 1) * (static_tokens / 1e6) * prices["cached_input"]  # Subsequent: cached
        )

        savings = cost_without_cache - cost_with_cache

        return {
            "without_cache_usd": round(cost_without_cache, 4),
            "with_cache_usd": round(cost_with_cache, 4),
            "savings_usd": round(savings, 4),
            "savings_pct": round(savings / cost_without_cache * 100, 1),
        }

# Example: 100K token document analyzed by 20 users
mgr = PromptCacheManager()
savings = mgr.estimate_caching_savings(
    static_tokens=100_000,
    n_queries=20,
    model="claude-3-5-sonnet-20241022",
)
print(f"Caching savings for 100K document × 20 queries:")
print(f"  Without caching: ${savings['without_cache_usd']:.2f}")
print(f"  With caching:    ${savings['with_cache_usd']:.2f}")
print(f"  Savings:         ${savings['savings_usd']:.2f} ({savings['savings_pct']}%)")
# Savings: $171.00 (95% savings at Claude 3.5 Sonnet pricing)

Multi-Turn Conversation Management

Long-context models create a new conversation management challenge: as conversations grow, the context window fills. You need a strategy for what happens when the conversation approaches the context limit.

Strategy 1: Sliding Window

from collections import deque

class SlidingWindowConversation:
    """
    Keep only the N most recent conversation turns in context.
    Simple, predictable, but loses long-term conversation history.
    """

    def __init__(
        self,
        max_turns: int = 10,
        system_prompt: str = "",
        tokenizer = None,
        max_tokens: int = 100_000,
    ):
        self.max_turns = max_turns
        self.system_prompt = system_prompt
        self.tokenizer = tokenizer
        self.max_tokens = max_tokens
        self.turns = deque()  # deque of {"role": str, "content": str}

    def add_turn(self, role: str, content: str):
        """Add a turn, dropping oldest if over limit."""
        self.turns.append({"role": role, "content": content})
        if len(self.turns) > self.max_turns * 2:  # Each turn = user + assistant
            self.turns.popleft()
            self.turns.popleft()

    def get_messages(self) -> list[dict]:
        """Return messages for API call."""
        messages = []
        if self.system_prompt:
            messages.append({"role": "system", "content": self.system_prompt})
        messages.extend(list(self.turns))
        return messages

    def token_count(self) -> int:
        """Estimate token count of current conversation."""
        if self.tokenizer is None:
            return sum(len(m["content"]) // 4 for m in self.get_messages())
        return sum(len(self.tokenizer.encode(m["content"])) for m in self.get_messages())

Strategy 2: Hierarchical Summarization

class SummarizingConversation:
    """
    Summarize old turns when context gets too long.
    Preserves long-term context at the cost of some detail loss.
    """

    def __init__(
        self,
        llm_client,
        model: str,
        system_prompt: str = "",
        max_tokens_before_compress: int = 80_000,
        keep_last_n_turns: int = 5,  # Always keep recent turns verbatim
    ):
        self.client = llm_client
        self.model = model
        self.system_prompt = system_prompt
        self.max_tokens = max_tokens_before_compress
        self.keep_last_n = keep_last_n_turns
        self.summary = ""  # Running summary of old conversation
        self.recent_turns = []  # Recent turns kept verbatim

    def estimate_tokens(self, text: str) -> int:
        return len(text) // 4  # Rough approximation

    def add_turn(self, role: str, content: str):
        """Add a turn and compress if needed."""
        self.recent_turns.append({"role": role, "content": content})
        total_tokens = self.estimate_tokens(
            self.summary + " ".join(t["content"] for t in self.recent_turns)
        )

        if total_tokens > self.max_tokens:
            self._compress_old_turns()

    def _compress_old_turns(self):
        """Summarize all but the most recent turns."""
        if len(self.recent_turns) <= self.keep_last_n * 2:
            return  # Don't compress if too few turns

        # Separate old turns (to compress) from recent turns (to keep)
        n_to_compress = len(self.recent_turns) - self.keep_last_n * 2
        old_turns = self.recent_turns[:n_to_compress]
        self.recent_turns = self.recent_turns[n_to_compress:]

        # Summarize the old turns
        conversation_text = "\n".join(
            f"{t['role'].upper()}: {t['content']}"
            for t in old_turns
        )
        if self.summary:
            to_summarize = f"Previous summary:\n{self.summary}\n\nNew conversation:\n{conversation_text}"
        else:
            to_summarize = conversation_text

        summary_prompt = (
            f"Summarize the following conversation, preserving all key facts, "
            f"decisions, and user preferences mentioned:\n\n{to_summarize}"
        )

        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": summary_prompt}],
            max_tokens=1000,
        )
        self.summary = response.choices[0].message.content
        print(f"Compressed {n_to_compress} turns into {self.estimate_tokens(self.summary)} token summary")

    def get_messages(self) -> list[dict]:
        """Build messages list for API call."""
        messages = []
        if self.system_prompt:
            messages.append({"role": "system", "content": self.system_prompt})
        if self.summary:
            messages.append({
                "role": "assistant",
                "content": f"[Summary of our earlier conversation: {self.summary}]"
            })
        messages.extend(self.recent_turns)
        return messages

Strategy 3: External Memory

class ExternalMemoryConversation:
    """
    Store conversation history in a vector database.
    At each turn, retrieve the most relevant past turns.

    Best for: very long conversations (100+ turns), factual accuracy,
    needs to recall specific details from much earlier.
    """

    def __init__(
        self,
        vector_store,
        llm_client,
        model: str,
        system_prompt: str = "",
        n_retrieved_turns: int = 5,
        keep_recent_n_turns: int = 3,
    ):
        self.vector_store = vector_store
        self.client = llm_client
        self.model = model
        self.system_prompt = system_prompt
        self.n_retrieved = n_retrieved_turns
        self.keep_recent = keep_recent_n_turns
        self.all_turns = []  # Full history for storage
        self.turn_id = 0

    def add_turn(self, role: str, content: str):
        """Add turn to both memory and vector store."""
        turn = {"id": self.turn_id, "role": role, "content": content}
        self.all_turns.append(turn)
        self.vector_store.upsert(
            id=f"turn_{self.turn_id}",
            text=content,
            metadata={"role": role, "turn_id": self.turn_id},
        )
        self.turn_id += 1

    def get_messages(self, current_query: str) -> list[dict]:
        """Build context by retrieving relevant past turns."""
        # Retrieve relevant past turns
        if len(self.all_turns) > self.keep_recent * 2:
            retrieved = self.vector_store.search(
                query=current_query,
                top_k=self.n_retrieved,
                filter={"turn_id": {"$lt": self.turn_id - self.keep_recent * 2}},
            )
            retrieved_turns = sorted(retrieved, key=lambda x: x["turn_id"])
        else:
            retrieved_turns = []

        # Build message list: system + retrieved context + recent turns
        messages = []
        if self.system_prompt:
            messages.append({"role": "system", "content": self.system_prompt})

        if retrieved_turns:
            retrieved_text = "\n".join(
                f"[Turn {t['turn_id']}] {t['role'].upper()}: {t['content']}"
                for t in retrieved_turns
            )
            messages.append({
                "role": "system",
                "content": f"Relevant earlier conversation turns:\n{retrieved_text}"
            })

        # Add recent turns verbatim
        recent = self.all_turns[-(self.keep_recent * 2):]
        messages.extend({"role": t["role"], "content": t["content"]} for t in recent)

        return messages

Production Patterns Summary

Pattern 1: Document Analysis with Caching

Best for: analyzing the same document with many different queries (legal review, code review, document audit).

Architecture:
Load document into long context (static prefix)
Enable prompt caching for the document
Query multiple times using only the dynamic query suffix
Cost: ~10% of full cost per query after first call

Pattern 2: Multi-Document Synthesis

Best for: research synthesis, comparative analysis, due diligence.

Architecture:
Retrieve top-N documents (RAG or full corpus if small enough)
Apply boundary placement (best docs at front and back)
Use long-context model for synthesis
Cache the document corpus if queried by multiple users

Pattern 3: Hierarchical Document Processing

Best for: very large documents (100+ pages) where important information may be anywhere.

Architecture:
Split into chunks (e.g., 4K tokens each)
Process each chunk to extract relevant passages (parallel)
Combine extracts (now much shorter total)
Final synthesis pass on combined extracts
Cost: chunk_cost × n_chunks + synthesis_cost

Pattern 4: Long-Form Generation with Self-Consistency

Best for: generating long coherent documents (reports, articles, code) where consistency matters.

Architecture:
Generate a detailed outline (short context)
For each section, include the outline + previous sections as context
This ensures consistency without O(n²) context growth
Final review pass of the complete document

Common Mistakes

:::danger Benchmarking with Needle-in-a-Haystack only NIAH is a minimal sanity check, not a capability assessment. A model that passes NIAH may still fail at multi-hop reasoning across a 128K context (RULER), at synthesis tasks requiring information from multiple positions, or at your specific production workload. Always benchmark on tasks representative of your actual use case. :::

:::warning Not accounting for prompt caching in cost models If you're making repeated queries against the same long context (common in document analysis), prompt caching reduces costs by 75-90% for the cached portion. Cost models that assume full input pricing for every query will dramatically overestimate real production costs - sometimes by 10×. Include caching in your financial analysis before deciding whether long-context is economically viable. :::

:::danger Ignoring the effective context length vs advertised context length gap A model advertised as "128K context" may reliably use information anywhere in 128K for simple retrieval but degrade significantly at 64K for complex synthesis tasks. The effective context length for your specific task may be substantially shorter than the advertised maximum. Benchmark at your actual expected context length on tasks representative of your workload - not at short test contexts or simple NIAH tests. :::

:::tip Use structured output format for long-context extraction For extraction tasks over long contexts, explicitly structure the output format. Instead of asking for "a summary of the risk factors," ask for a structured response: "For each risk factor found, output: Risk Name | Location in Document | Severity (High/Med/Low) | Mitigation Mentioned (Yes/No)." Structured output forces the model to be explicit about what it found vs. what it's inferring, and makes verification easier. :::

:::warning Monitor time-to-first-token for long-context applications At 128K input tokens, the prefill phase (computing KV cache for all input tokens) can take 10-30 seconds. For interactive applications, this is often unacceptable. Consider streaming responses so users see partial output before the full answer, or pre-warming the KV cache for known contexts (using prompt caching APIs). :::

Interview Q&A

Q: How would you architect a production system that analyzes 200-page legal contracts using a 128K context LLM?

A: Multi-layer approach. First, a preprocessing pipeline: OCR if needed, then tokenization to check if the full contract fits in context (typical 200-page contract ≈ 50-80K tokens - usually fits). For contracts that fit, use full-context analysis with strategic prompt structure: place the most important contract sections (indemnification, termination, limitation of liability) at both the beginning and end of the context to exploit primacy and recency effects. For contracts that don't fit, use hierarchical processing: extract clauses from each section in parallel, then synthesize across extracted clauses. Enable prompt caching when the same contract is analyzed multiple times (multiple reviewers, multiple question types). Monitor time-to-first-token and implement streaming for the UI. Track cost per contract and alert when per-contract cost exceeds threshold.

Q: What is prompt caching, and how does it change the economics of long-context applications?

A: Prompt caching stores the KV representations of a static prompt prefix so subsequent calls with the same prefix can skip the expensive prefill computation. Providers like Anthropic charge approximately 10% of regular price for cached tokens and the time-to-first-token is reduced. The economics change dramatically for use cases where the same long context is queried repeatedly. For example, with Claude 3.5 Sonnet at $3/M input tokens: analyzing a 100K-token document with 20 different queries costs$ 20 without caching (20 × $0.30). With caching:$ 0.30 for the first call + 19 × $0.03 ≈$ 0.87 total. That's 23× cheaper. The break-even point - when caching saves more than the cache creation overhead - is typically the second query.

Q: Your 128K context model is answering questions about a document correctly when the relevant section is in the first 30K tokens, but failing when it's in the middle 60K tokens. What's wrong and how do you fix it?

A: This is the lost-in-middle problem from Liu et al. 2023. The model's attention mechanism disproportionately attends to tokens at the beginning and end of the context, missing middle content. Fixes, in order of effort: (1) Restructure the prompt to put the most critical information at both the beginning and end by explicitly repeating key constraints or the most relevant passage there. (2) Add explicit instruction: "The answer may appear in the middle of the document - read all sections carefully." This provides a 5-10% improvement. (3) Use hierarchical processing: extract relevant sentences from each 10K-token chunk, then do a final synthesis pass over only the extracts. (4) Switch to a model with stronger long-context performance at that length (Claude 3.5 Sonnet has better middle-position recall than GPT-4o at 100K contexts based on RULER benchmarks).

Q: How do you handle a long multi-turn conversation that's approaching the context limit?

A: Three main strategies. Sliding window: drop the oldest turns, keeping only the N most recent. Simple but loses long-term context - the model forgets what was said early. Hierarchical summarization: periodically compress old turns into a running summary using the LLM itself. The summary is prepended as context for subsequent turns. Preserves key facts at the cost of some detail loss. External memory: store all turns in a vector database; at each new turn, retrieve the most semantically relevant past turns. Works best for very long conversations where specific details from much earlier may be relevant. The right choice depends on conversation type: for short-session, topically focused conversations, sliding window is sufficient; for multi-session, fact-heavy conversations, external memory is more appropriate.

Q: Compare GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro for a use case requiring 200K token contexts.

A: GPT-4o is limited to 128K tokens - not an option for 200K. Between Claude 3.5 Sonnet (200K) and Gemini 1.5 Pro (1M): Sonnet has better quality for complex synthesis and reasoning tasks at that context length based on RULER benchmarks, while Gemini 1.5 Pro costs approximately the same per token but handles larger contexts if you later need to grow beyond 200K. For a strict 200K requirement, Claude 3.5 Sonnet is the higher-quality choice; for a workload that might scale to 500K-1M tokens, Gemini 1.5 Pro provides headroom. Cost-wise at 200K context: Claude at $3/M =$ 0.60/call; Gemini Pro at $3.5/M =$ 0.70/call - similar. At high volume, consider Gemini 1.5 Flash ( $0.075/M =$ 0.015/call) if quality is acceptable, representing a 40× cost reduction.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Long Context: Lost in the Middle demo on the EngineersOfAI Playground - no code required.

:::

From Capability to Deployment​

Model Comparison - Choosing the Right Long-Context Model​

Capability and Cost Summary (2024)​

Effective Performance at Different Context Lengths​

Model Selection Decision Tree​

Prompt Structure for Long-Context Applications​

The Recommended Structure​

What Goes Where in a Long Prompt​

Cost Management at Scale​

Estimating and Controlling Costs​

Caching Strategy​

Multi-Turn Conversation Management​

Strategy 1: Sliding Window​

Strategy 2: Hierarchical Summarization​

Strategy 3: External Memory​

Production Patterns Summary​

Pattern 1: Document Analysis with Caching​

Pattern 2: Multi-Document Synthesis​

Pattern 3: Hierarchical Document Processing​

Pattern 4: Long-Form Generation with Self-Consistency​

Common Mistakes​

Interview Q&A​