Case Studies: Production LLM Systems
What Real Systems Look Like
There is a significant gap between the LLM tutorial and the production system. Tutorials show you how to call the API. They rarely show you how to handle 1.8 million users generating billions of API calls per day, how to keep a completion latency under 150ms when each inference call takes 800ms, or why the model you chose for your MVP is the wrong model for your third production incident.
The case studies in this lesson are not hypothetical. They are reconstructed from public engineering blog posts, conference talks, and technical interviews with engineers who built these systems. Where numbers are cited, they come from public sources. Where specific internal details are not available, the analysis is based on first-principles reasoning about the constraints any engineering team would face building these systems.
Five systems, five different problem domains, five different sets of tradeoffs. After each case study, the key patterns are extracted - these patterns appear across multiple systems and represent the closest thing to "best practices" that this field has developed.
Case Study 1: GitHub Copilot
Context
GitHub Copilot launched in October 2021, powered by Codex. By 2023 it had 1.8 million paid subscribers. It runs as an IDE extension that watches every keystroke and generates code completions in real time. The engineering constraints are severe: the completion must appear fast enough to feel instantaneous while the user is still typing - or it becomes a distraction rather than a tool.
Architecture
Context Assembly
The key engineering insight in Copilot is how context is assembled. The model receives a fixed-size context window, and filling it intelligently is the difference between good suggestions and irrelevant ones.
Context elements (roughly in priority order):
- The file currently being edited (prefix up to cursor + suffix after cursor)
- Recently edited files in the current session
- Files open in other editor tabs
- Files with similar names (e.g., the
.test.tsfile when editing a.tsfile) - File path structure (the directory tree as text, limited)
- Language-specific signals: imports, function signatures, class definitions
This is called "fill in the middle" (FIM): the model receives the code before AND after the cursor position and must predict what belongs in the gap. FIM-trained models (Codex, Code Llama, DeepSeek Coder) significantly outperform left-to-right models for mid-function completion.
Latency Architecture
150ms perceived latency is the target for inline suggestions. Achieving this requires several techniques working together.
Debouncing: don't send a request for every keystroke. Wait until the user has stopped typing for 75ms. If a new keystroke arrives before the request returns, cancel the old request and start fresh.
Speculative completions: pre-generate completions for likely cursor positions before the user arrives there. If the user accepts a completion, the extension can immediately start generating the next line speculatively.
Client-side cache: cache completions keyed by (file content hash, cursor position). If the user undoes a change and retypes the same code, the cached completion appears instantly.
Streaming: start displaying the completion as tokens arrive, not after the full completion is ready. The first token of a Codex/GPT-4 response arrives in ~200ms; by the time the user reads the first 10 tokens, the rest have arrived.
import asyncio
import hashlib
import time
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class CompletionRequest:
prefix: str # code before cursor
suffix: str # code after cursor (FIM)
language: str
file_path: str
cursor_line: int
cursor_col: int
@dataclass
class CopilotCompletionEngine:
"""
Simplified model of Copilot's debounce + cache + request management.
"""
debounce_ms: float = 75.0
cache: dict = field(default_factory=dict)
pending_request: Optional[asyncio.Task] = None
def _cache_key(self, req: CompletionRequest) -> str:
content = f"{req.prefix}|||{req.suffix}|||{req.language}"
return hashlib.sha256(content.encode()).hexdigest()[:16]
async def get_completion(
self,
req: CompletionRequest,
llm_fn,
) -> Optional[str]:
# Check cache first
key = self._cache_key(req)
if key in self.cache:
return self.cache[key]
# Cancel any pending request
if self.pending_request and not self.pending_request.done():
self.pending_request.cancel()
# Debounce: wait for typing to stop
async def _delayed_call():
await asyncio.sleep(self.debounce_ms / 1000)
result = await llm_fn(req)
if result:
self.cache[key] = result
return result
self.pending_request = asyncio.create_task(_delayed_call())
try:
return await self.pending_request
except asyncio.CancelledError:
return None # new keystroke arrived, request was cancelled
Scale and Model Selection
At 1.8M subscribers generating an average of ~100 suggestions per developer per hour (estimated from public talks), Copilot processes on the order of hundreds of millions of completions per day. At this scale, model selection is a cost and latency decision, not just a quality decision.
GitHub has been transparent about using different models for different contexts:
- Inline completions (the primary product): latency-optimized model. GPT-4 was too slow for inline use - the first token arrives 400–600ms after the request, over the acceptable threshold. GitHub uses a smaller, faster model for inline completions.
- Copilot Chat (conversational): quality-optimized model. GPT-4 is acceptable here because the user expects to wait 2–5 seconds for a chat response.
This model tiering by interaction type is a pattern that appears in every case study in this lesson.
Lessons Learned
Lesson 1: Latency beats quality for inline suggestions. A 95% quality completion that appears in 120ms is more useful than a 99% quality completion that appears in 400ms. Users will dismiss the 400ms suggestion before it fully renders.
Lesson 2: Context assembly quality is the primary quality lever. The model is fixed. You cannot make it smarter in production. But you can give it better context. Most of Copilot's quality improvements over time have come from better context assembly, not model size increases.
Lesson 3: Speculative execution hides latency. By pre-generating likely completions and caching at the client, the perceived latency is often 0ms - the completion was already there. This is the single most impactful latency optimization.
Case Study 2: Notion AI
Context
Notion AI launched in February 2023, integrated directly into the Notion workspace. Unlike Copilot (one task: code completion), Notion AI supports 15+ different task types: summarize, improve writing, fix spelling and grammar, continue writing, find action items, explain this, translate, and more. Each task has different quality requirements, different prompt structures, and different optimal models.
Multi-Task Architecture
Separation of Concerns: One Prompt Per Task
A naive implementation would use one general-purpose system prompt and put the task instruction in the user message. Notion's engineering team found that task-specific prompts dramatically outperform general-purpose prompts for formatting tasks. The "improve writing" task requires a specific tone and style profile. The "find action items" task requires structured JSON output. The "translate" task requires language-specific behavior.
from dataclasses import dataclass
from enum import Enum
from openai import OpenAI
class NotionTaskType(Enum):
SUMMARIZE = "summarize"
IMPROVE_WRITING = "improve_writing"
FIX_GRAMMAR = "fix_grammar"
CONTINUE_WRITING = "continue_writing"
FIND_ACTION_ITEMS = "find_action_items"
EXPLAIN = "explain"
TRANSLATE = "translate"
@dataclass
class TaskConfig:
system_prompt: str
model: str
temperature: float
max_tokens: int
response_format: str = "text" # or "json"
TASK_CONFIGS: dict[NotionTaskType, TaskConfig] = {
NotionTaskType.FIX_GRAMMAR: TaskConfig(
system_prompt="""You are an expert editor. Fix only grammar, spelling, and punctuation errors.
Do NOT change the author's voice, tone, or word choices. Do NOT rewrite sentences unless grammatically broken.
Return only the corrected text with no explanation.""",
model="gpt-4o-mini", # Simple task -> cheap model
temperature=0.1, # Low temperature for deterministic edits
max_tokens=2048,
),
NotionTaskType.IMPROVE_WRITING: TaskConfig(
system_prompt="""You are an expert writing coach. Improve the clarity, flow, and impact of this text.
Preserve the core meaning and the author's voice. Make it more engaging and professional.
Return only the improved text with no explanation.""",
model="gpt-4o", # Creative task -> capable model
temperature=0.7,
max_tokens=4096,
),
NotionTaskType.FIND_ACTION_ITEMS: TaskConfig(
system_prompt="""Extract all action items from the following text.
An action item is a task, commitment, or next step assigned to a person or unassigned.
Return a JSON array of objects with fields: {task: string, assignee: string|null, due: string|null}
Return only the JSON array, nothing else.""",
model="gpt-4o-mini",
temperature=0.0, # Zero temperature for structured extraction
max_tokens=1024,
response_format="json",
),
NotionTaskType.SUMMARIZE: TaskConfig(
system_prompt="""Summarize the following text in 2-3 concise paragraphs.
Capture the main ideas, key decisions, and important details.
Write in a clear, professional tone. Return only the summary.""",
model="gpt-4o-mini",
temperature=0.3,
max_tokens=512,
),
}
class NotionAI:
def __init__(self, client: OpenAI):
self.client = client
def process(
self,
task_type: NotionTaskType,
selected_text: str,
page_context: str = "",
) -> str:
config = TASK_CONFIGS[task_type]
user_message = selected_text
if page_context and task_type in [
NotionTaskType.CONTINUE_WRITING,
NotionTaskType.SUMMARIZE,
]:
user_message = f"Page context:\n{page_context}\n\nSelected text:\n{selected_text}"
kwargs = {
"model": config.model,
"messages": [
{"role": "system", "content": config.system_prompt},
{"role": "user", "content": user_message},
],
"max_tokens": config.max_tokens,
"temperature": config.temperature,
"stream": True,
}
if config.response_format == "json":
kwargs["response_format"] = {"type": "json_object"}
return self.client.chat.completions.create(**kwargs)
Rate Limiting: Credit System
Notion AI uses a workspace-level credit system rather than per-user rate limits. Free workspaces get 20 AI responses per month total. Paid workspaces get unlimited access. This prevents a single heavy user from making the product feel limited to their whole team.
The credit system creates interesting engineering complexity: when a workspace is out of credits, you need to surface a clear error immediately (before calling the LLM), and you need to handle race conditions when two users hit the last credit simultaneously.
Lessons Learned
Lesson 1: Task-specific prompts and models beat a single general-purpose approach. A fine-tuned smaller model for grammar correction outperforms GPT-4 with a generic editing prompt, and costs 50× less per call. Invest in task decomposition upfront.
Lesson 2: Streaming is a product requirement, not an optimization. Users who see text appearing token by token perceive the response as faster and more intelligent than users who see a loading spinner followed by a block of text, even when the actual time-to-complete is identical. Implement streaming from day one.
Lesson 3: Model selection is a continuous optimization, not a one-time decision. Notion has changed their model selection multiple times - as new cheaper models became available, they re-evaluated which tasks could be downgraded. Treat model selection as an A/B testable parameter.
Case Study 3: Customer Support Bot
Context
This case study represents a pattern deployed by dozens of companies: an LLM-powered customer support assistant that handles Tier 1 inquiries. The architecture here is a composite based on multiple public descriptions from companies including Intercom, Zendesk, and several startups.
The key constraints for customer support: users expect near-human quality, mistakes have real business cost (wrong billing information, incorrect refund policy), and the system must know when to hand off to a human agent.
Architecture
Intent Classification and Routing
Before calling a powerful and expensive model, classify the user's intent with a cheap, fast classifier. This determines which handler processes the request.
from enum import Enum
from openai import OpenAI
class SupportIntent(Enum):
FAQ = "faq" # Can answer from knowledge base
ACCOUNT_ACTION = "account_action" # Needs tool calls (account data)
COMPLAINT = "complaint" # High-stakes, route to human
BILLING = "billing" # Needs billing system access
UNKNOWN = "unknown" # Escalate
INTENT_PROMPT = """Classify this customer support message into one category:
- FAQ: general questions about product features, pricing, policies
- ACCOUNT_ACTION: requests that require looking up or modifying account data
- BILLING: questions or disputes about charges, invoices, refunds
- COMPLAINT: expressions of dissatisfaction, frustration, or formal complaints
- UNKNOWN: unclear or cannot be classified
Message: {message}
Respond with only the category name (FAQ, ACCOUNT_ACTION, BILLING, COMPLAINT, or UNKNOWN)."""
def classify_intent(message: str, client: OpenAI) -> SupportIntent:
response = client.chat.completions.create(
model="gpt-4o-mini",
max_tokens=20,
temperature=0,
messages=[
{
"role": "user",
"content": INTENT_PROMPT.format(message=message[:500]),
}
],
)
raw = response.choices[0].message.content.strip().upper()
try:
return SupportIntent[raw]
except KeyError:
return SupportIntent.UNKNOWN
Human Handoff with Context Transfer
The handoff to a human agent is the most important interaction in a support bot. A botched handoff - where the human agent cannot see the conversation history, the user has to repeat themselves, and the context is lost - is worse than never having had the bot at all.
from dataclasses import dataclass
from typing import Optional
import json
@dataclass
class ConversationContext:
conversation_id: str
user_id: str
messages: list[dict]
intent_history: list[str]
entities_extracted: dict # account_id, order_id, etc. from tool calls
escalation_reason: str
bot_confidence: float
suggested_resolution: Optional[str]
def prepare_agent_handoff(ctx: ConversationContext) -> dict:
"""
Prepare a structured handoff package for the human agent.
This appears in the agent's UI when they receive the conversation.
"""
# Generate a summary for the human agent using LLM
summary_prompt = f"""Summarize this customer conversation for a human support agent.
Include: what the customer needs, what the bot tried, why it escalated, and suggested next steps.
Keep it under 100 words.
Conversation: {json.dumps(ctx.messages[-6:], indent=2)}"""
# (In production, call LLM here)
# summary = llm.complete(summary_prompt)
return {
"conversation_id": ctx.conversation_id,
"user_id": ctx.user_id,
"full_transcript": ctx.messages,
"escalation_reason": ctx.escalation_reason,
"bot_confidence": ctx.bot_confidence,
"entities": ctx.entities_extracted, # pre-populated account data
"suggested_resolution": ctx.suggested_resolution,
# "summary": summary,
"priority": "high" if ctx.bot_confidence < 0.3 else "normal",
}
CSAT as Ground Truth for Evaluation
Post-conversation customer satisfaction (CSAT) surveys are the most honest evaluation signal available. Correlate CSAT scores with:
- Which intent category the conversation was classified as
- Whether the bot handled it or escalated
- Which response model was used
- Which prompt version generated the final response
import pandas as pd
import numpy as np
def analyze_csat_by_segment(
conversations_df: pd.DataFrame,
) -> pd.DataFrame:
"""
Group CSAT scores by key dimensions to identify improvement opportunities.
"""
return (
conversations_df.groupby(["intent_category", "handled_by", "model_used"])
.agg(
avg_csat=("csat_score", "mean"),
response_count=("conversation_id", "count"),
escalation_rate=("was_escalated", "mean"),
)
.reset_index()
.sort_values("avg_csat", ascending=True) # worst first
)
Lessons Learned
Lesson 1: The intent classification step determines 80% of the outcome. A correctly classified intent routes to the right handler. A misclassified intent produces a bad response regardless of how good the handler is. Invest heavily in intent classifier accuracy, especially at the boundaries between categories.
Lesson 2: Design the escalation path before the bot path. Most teams design the happy path (bot handles everything) and retrofit escalation. The better approach is to design escalation first: what context does the human agent need? How does the conversation state transfer? Then build the bot path as a pre-filter that reduces escalation volume.
Lesson 3: CSAT correlation drives product decisions. When you can correlate CSAT scores with specific conversation patterns, you stop guessing about what to improve. "Billing questions handled by the bot have 2.1 CSAT vs 4.3 CSAT when handled by humans" is actionable - disable bot handling for billing, or invest in improving it.
Case Study 4: Enterprise Document Search (RAG at Scale)
Context
An enterprise software company with 5,000 employees needed a document search system that could answer questions across 200,000 internal documents: engineering specs, HR policies, sales playbooks, legal agreements, historical decisions. The challenge: multi-tenancy (different departments can see different documents), freshness (documents update frequently), and quality (wrong answers in an enterprise context have real consequences).
Indexing Pipeline
Query Pipeline
Multi-Tenant Access Control
The hardest engineering problem in enterprise RAG is not retrieval quality - it is access control. An employee in the marketing department must not receive answers grounded in confidential engineering roadmap documents. Access control must be enforced at query time, not at indexing time.
from dataclasses import dataclass
from typing import Optional
@dataclass
class UserContext:
user_id: str
department: str
access_groups: list[str] # e.g., ["engineering", "all-staff", "managers"]
clearance_level: int # 0 = public, 1 = internal, 2 = confidential, 3 = restricted
def build_access_filter(user: UserContext) -> dict:
"""
Build a Pinecone/Weaviate metadata filter that enforces access control.
Documents are indexed with their required access_groups.
A document is accessible if ANY of its required groups matches ANY of the user's groups.
"""
return {
"$or": [
{"access_groups": {"$in": user.access_groups}},
{"access_groups": {"$eq": "all-staff"}},
],
"clearance_level": {"$lte": user.clearance_level},
}
async def rag_query(
user_query: str,
user: UserContext,
vector_store,
embedding_model,
llm_client,
) -> dict:
# Step 1: Embed query
query_embedding = await embedding_model.embed(user_query)
# Step 2: Retrieve with access filter
access_filter = build_access_filter(user)
raw_results = await vector_store.query(
vector=query_embedding,
filter=access_filter,
top_k=20,
include_metadata=True,
)
# Step 3: Rerank (cross-encoder, no access filter needed - already applied)
reranked = await rerank(user_query, raw_results, top_k=5)
# Step 4: Generate with citations
context = "\n\n".join(
f"[Source {i+1}: {r.metadata['doc_title']}]\n{r.text}"
for i, r in enumerate(reranked)
)
response = await llm_client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": """Answer the question using only the provided sources.
For every claim, cite the source number in brackets like [1] or [2].
If the sources do not contain enough information to answer, say so explicitly.""",
},
{
"role": "user",
"content": f"Sources:\n{context}\n\nQuestion: {user_query}",
},
],
)
return {
"answer": response.choices[0].message.content,
"sources": [
{
"title": r.metadata["doc_title"],
"url": r.metadata["doc_url"],
"chunk": r.text[:200],
}
for r in reranked
],
}
Freshness: Incremental Re-indexing
Enterprise documents change. An HR policy updated yesterday that the RAG system answers based on the old version is worse than no RAG system at all. Incremental re-indexing must be a first-class requirement.
import hashlib
import asyncio
from datetime import datetime
class IncrementalIndexer:
"""
Tracks document versions and re-indexes only changed documents.
"""
def __init__(self, vector_store, embedding_model, doc_registry):
self.store = vector_store
self.embedder = embedding_model
self.registry = doc_registry # DB table: doc_id, content_hash, indexed_at
async def sync_document(self, doc_id: str, content: str, metadata: dict):
content_hash = hashlib.sha256(content.encode()).hexdigest()
existing = await self.registry.get(doc_id)
if existing and existing["content_hash"] == content_hash:
return # no change, skip
# Delete old chunks
if existing:
await self.store.delete(filter={"doc_id": {"$eq": doc_id}})
# Index new chunks
chunks = self._chunk(content)
embeddings = await asyncio.gather(
*[self.embedder.embed(chunk) for chunk in chunks]
)
vectors = [
{
"id": f"{doc_id}-chunk-{i}",
"values": emb,
"metadata": {
**metadata,
"doc_id": doc_id,
"chunk_index": i,
"chunk_text": chunk,
"indexed_at": datetime.utcnow().isoformat(),
},
}
for i, (chunk, emb) in enumerate(zip(chunks, embeddings))
]
await self.store.upsert(vectors)
await self.registry.set(doc_id, content_hash)
def _chunk(self, text: str, size: int = 512, overlap: int = 50) -> list[str]:
words = text.split()
chunks = []
for i in range(0, len(words), size - overlap):
chunk = " ".join(words[i:i + size])
if chunk:
chunks.append(chunk)
return chunks
Lessons Learned
Lesson 1: Retrieval quality is 80% of the problem. The LLM cannot synthesize a good answer from irrelevant chunks. Before optimizing prompts, ensure your retrieval pipeline (hybrid search + reranking) returns relevant, high-quality chunks. Most production RAG failures are retrieval failures, not generation failures.
Lesson 2: Query expansion improves recall on enterprise vocabulary. Internal documents use domain-specific terminology. A user asking "what is the vacation policy" may not match a document titled "PTO accrual and carry-forward guidelines." Generate 2-3 alternative phrasings of each query before retrieval - cheap (gpt-4o-mini) and significantly improves recall.
Lesson 3: Access control at query time, not index time. Applying access filters at indexing (separate indices per department) creates maintenance nightmares when permissions change. Apply filters as metadata predicates at query time - the vector database filters out unauthorized chunks before retrieval.
Case Study 5: Code Review Agent
Context
A developer tools company built an automated code review agent that analyzes pull requests, identifies issues, suggests improvements, and posts structured comments. Unlike Copilot (synchronous, latency-critical), code review is asynchronous - the developer submits a PR and checks comments when ready. The acceptable latency window is minutes, not milliseconds. This changes the optimization target from latency to cost and thoroughness.
Agentic Architecture
Tool Definitions
from openai import OpenAI
import json
import subprocess
CODE_REVIEW_TOOLS = [
{
"type": "function",
"function": {
"name": "search_codebase",
"description": "Search the codebase for related files, functions, or patterns.",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Search query (function name, pattern, or description)",
},
"file_type": {
"type": "string",
"enum": ["py", "ts", "js", "go", "java", "all"],
"description": "Filter by file extension",
},
},
"required": ["query"],
},
},
},
{
"type": "function",
"function": {
"name": "run_linter",
"description": "Run the linter on a specific file and return issues found.",
"parameters": {
"type": "object",
"properties": {
"file_path": {
"type": "string",
"description": "Path to the file to lint",
},
},
"required": ["file_path"],
},
},
},
{
"type": "function",
"function": {
"name": "get_function_definition",
"description": "Get the full definition of a function or class in the codebase.",
"parameters": {
"type": "object",
"properties": {
"symbol": {
"type": "string",
"description": "Function or class name to look up",
},
"file_path": {
"type": "string",
"description": "Optional: specific file to search in",
},
},
"required": ["symbol"],
},
},
},
]
async def run_code_review_agent(
pr_diff: str,
repo_context: dict,
client: OpenAI,
max_iterations: int = 5,
) -> list[dict]:
"""
Agentic code review: plan -> parallel tool calls -> synthesize comments.
"""
messages = [
{
"role": "system",
"content": """You are an expert code reviewer. Analyze this pull request and identify:
1. Correctness issues (bugs, logic errors, off-by-one errors)
2. Security vulnerabilities (SQL injection, XSS, insecure dependencies)
3. Performance problems (N+1 queries, unnecessary re-renders, memory leaks)
4. Code quality issues (missing error handling, unclear naming, missing tests)
Use the available tools to look up context before commenting.
Be specific: cite line numbers and explain why each issue matters.
Be constructive: suggest the fix, not just the problem.""",
},
{
"role": "user",
"content": f"""Review this pull request:
Repository context:
{json.dumps(repo_context, indent=2)}
PR Diff:
{pr_diff}
Use tools to look up any context you need, then provide structured review comments.""",
},
]
for iteration in range(max_iterations):
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=CODE_REVIEW_TOOLS,
tool_choice="auto",
temperature=0.2, # Low temperature for deterministic review
max_tokens=4096,
)
msg = response.choices[0].message
messages.append(msg)
# No more tool calls -> agent is done
if not msg.tool_calls:
break
# Execute tool calls (can parallelize)
import asyncio
tool_results = await asyncio.gather(
*[execute_tool(tc) for tc in msg.tool_calls]
)
for tool_call, result in zip(msg.tool_calls, tool_results):
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(result),
})
# Extract final structured comments
return parse_review_comments(messages[-1].content)
async def execute_tool(tool_call) -> dict:
"""Execute a single tool call. Real implementation connects to actual tools."""
name = tool_call.function.name
args = json.loads(tool_call.function.arguments)
if name == "run_linter":
# Run actual linter
result = subprocess.run(
["pylint", args["file_path"], "--output-format=json"],
capture_output=True,
text=True,
)
return {"issues": json.loads(result.stdout or "[]")}
elif name == "search_codebase":
# Search using ripgrep or similar
result = subprocess.run(
["rg", args["query"], "--json", "-l"],
capture_output=True,
text=True,
)
return {"files": [line for line in result.stdout.splitlines()[:10]]}
return {"error": f"Unknown tool: {name}"}
Cost Analysis
At GPT-4o rates, a thorough code review costs 0.08 per PR (2,000–8,000 input tokens for diff + context, 1,000–3,000 output tokens for comments). At 100 PRs/day, that is 8/day, well within budget for most engineering organizations.
Cost optimization strategies:
- Cheap model first pass: use GPT-4o-mini to triage the diff. Only send PRs above a complexity threshold to GPT-4o.
- Context budget: limit codebase context to the 10 most relevant files. Searching the entire codebase increases cost without proportional quality gain.
- Comment deduplication: if the agent produces 15 similar comments about the same pattern, collapse them into one. Fewer, better comments are more useful.
- Low-temperature caching: code review for identical diffs can be cached. When developers re-push the same change, skip the agent call.
Determinism: Temperature and Caching
Code reviews benefit from low temperature (0.1–0.3): deterministic enough that the same diff produces consistent comments, but with enough variation to avoid formulaic output. At temperature 0, the agent becomes too rigid - it misses issues that require creative reasoning.
Caching strategy: hash the git diff (not the PR ID). If the same diff appears again (e.g., a force push that changes nothing), return the cached review. This is correct behavior: the code is the same, so the review should be the same.
Lessons Learned
Lesson 1: Agentic code review quality scales with tool quality, not model quality. The model's review is only as good as the context it receives. If search_codebase returns irrelevant files, the review will miss important patterns. Invest in tool implementation quality.
Lesson 2: Parallelizing tool calls is critical for latency. A sequential tool execution strategy (search → lint → docs → synthesize) takes 3-4 minutes. Parallel tool execution (search + lint + docs simultaneously) reduces wall time to 45-60 seconds. Always parallelize independent tool calls.
Lesson 3: Comment quality matters more than comment quantity. A review with 3 specific, actionable comments is more valuable than one with 15 generic observations. Implement a post-processing step that scores and filters comments by specificity and actionability before posting.
Patterns Across All Case Studies
Pattern 1: Model Tiering
Every production system uses at least two model tiers. The cheap, fast tier (GPT-4o-mini, Claude 3 Haiku) handles classification, routing, and simple tasks. The capable tier (GPT-4o, Claude 3.5 Sonnet) handles generation, reasoning, and complex analysis.
| Tier | Models | Use Cases | Cost per 1K tokens |
|---|---|---|---|
| Fast/cheap | GPT-4o-mini, Claude 3 Haiku | classification, routing, extraction | 0.00025 |
| Capable | GPT-4o, Claude 3.5 Sonnet | generation, reasoning, analysis | 0.003 |
| Specialized | Codex for code, embedding models | code completion, search | varies |
Pattern 2: Separation of Concerns
The worst LLM architectures try to do everything in one prompt. The best architectures separate concerns:
Input → Classification → Task-specific handler → Output validation → User
Each step has a clear responsibility. Each can be optimized, monitored, and replaced independently.
Pattern 3: Human-in-the-Loop at the Boundaries
Every case study has human escalation or review paths. The pattern is not "LLM does everything" - it is "LLM handles the easy cases so humans can focus on the hard cases."
The escalation trigger design matters:
- Copilot: no escalation (the IDE is the safety net - bad completion is just not accepted)
- Notion AI: the user accepts or rejects the suggestion - always human-in-the-loop
- Support bot: explicit confidence thresholds, CSAT feedback loop
- Enterprise RAG: groundedness check routes low-confidence answers to human review
- Code review: low-severity issues posted automatically, high-severity issues flagged for human review
Pattern 4: Offline Evaluation Before Production Changes
Every team that has run a serious LLM system for more than three months has built an offline evaluation pipeline:
- Maintain a labeled dataset of (input, expected_output) pairs
- On every prompt change or model update, run the new version against the dataset
- Compare aggregate quality scores (and per-example regressions) against the current production version
- Block deployment if quality regresses on any segment of the dataset
This is the single most important practice for maintaining production quality over time.
Pattern 5: Instrument Everything, Sample for Review
Log every LLM request: prompt, response, model, tokens, latency, cost, user_id, feature_id, prompt_version. Store these logs in a queryable store. Sample 5–10% for automated quality evaluation. Review 20 random examples per week manually.
The combination of full logging and sampled review catches both aggregate trends (automated metrics) and novel failure modes (manual review).
Common Mistakes
:::danger Building the "do everything in one prompt" architecture When a feature grows from one use case to five, the temptation is to add conditions and special cases to a single growing system prompt. The result is a prompt that tries to be a product manager, a customer service agent, a technical writer, and a sales assistant simultaneously - and does none of them well. Decompose into task-specific handlers early. :::
:::danger Ignoring the escalation path until after launch The escalation path to a human agent is not a fallback to design later. It is a core product experience. Users who need a human agent and cannot find a clear path are the most frustrated users you have. Design the escalation experience first, then build the bot automation as a layer on top of it. :::
:::warning Treating evaluation as a launch-time activity "We evaluated the system before launch and it performed well." LLM systems degrade. Models change (providers update models silently). Prompts drift. Input distributions shift. Evaluation is a continuous activity, not a gate you pass once. Build the evaluation pipeline before you build the product. :::
:::danger Skipping access control in multi-tenant RAG It is tempting to build a single shared index and trust that the LLM "won't reveal" restricted documents. This is not access control - it is wishful thinking. The LLM has no concept of access permissions. It will reveal whatever is in its context window. Apply metadata filters at query time to enforce access control at the retrieval layer, before any document content reaches the LLM. :::
:::warning Optimizing for latency when the task is async Code review, document indexing, and batch analysis are async tasks where users do not feel latency directly. Over-optimizing for latency in these contexts wastes engineering effort and sacrifices quality. Prioritize thoroughness and cost efficiency for async tasks, latency for synchronous user-facing interactions. :::
Interview Questions
Q: Compare GitHub Copilot's architecture with a customer support bot. What are the fundamental differences in their design constraints?
A: Three fundamental differences. Latency: Copilot has a 150ms target - users feel anything slower as an interruption. A support bot has a 3–5 second acceptable window. This changes model selection (Copilot uses a smaller, faster model for inline; support uses a capable model for quality), caching (Copilot uses aggressive client-side speculative caching; support caches at the server level), and streaming (Copilot streams tokens as they arrive; support can also stream but the latency window is more forgiving). Error tolerance: Copilot errors are invisible - a bad suggestion is dismissed, the user types on. Support bot errors are highly visible - a wrong policy statement or a failed handoff is a support ticket. Copilot optimizes for volume of good suggestions; support optimizes for reliability. Interaction model: Copilot is one-shot (predict the next token given context). Support is multi-turn (maintain conversation state, handle topic changes, escalate on confidence thresholds). Support requires session management, intent tracking across turns, and a defined escalation path. Copilot has none of these.
Q: How would you design the access control system for an enterprise RAG application serving 5,000 employees across 20 departments?
A: Four-layer approach. First, document metadata at index time: when a document is indexed, store its access_groups (list of group IDs that can see it) and clearance_level (0-3) as vector metadata. This happens once at indexing and does not need to change unless document permissions change. Second, query-time filter: every search includes a metadata filter - only return chunks where the user's groups intersect with the chunk's access_groups, AND the clearance_level does not exceed the user's clearance. Never retrieve an unauthorized chunk and then filter in post-processing - that approach relies on the LLM not revealing the content, which is not a security guarantee. Third, citation provenance: every source cited in a response is verified against the access filter. If the response cites a document that the user should not see (a bug in the filter logic), catch it before returning. Fourth, audit logging: log every document retrieval with user_id, document_id, and the access filter applied. Quarterly access reviews cross-check these logs against current permission assignments. The critical constraint is that access control must be applied before any document content reaches the LLM context window - not after.
Q: Walk through how you would design the cost control system for an agentic code review bot processing 500 PRs per day.
A: At 500 PRs per day with an average cost of 25/day (0.005). Only complex PRs go to GPT-4o (25 to 8-12/day versus the $25 baseline.
Q: The Notion AI "fix grammar" task uses gpt-4o-mini while "improve writing" uses gpt-4o. How would you decide which model to use for a new task?
A: Model selection is a quality vs cost tradeoff that should be decided empirically, not by intuition. The decision framework has three steps. First, characterize the task: is it deterministic (grammar correction has one right answer per input) or creative (writing improvement has many acceptable outputs)? Deterministic tasks work well with cheaper models at low temperature. Creative tasks often require capable models to produce high-quality variation. Second, run a pilot experiment: generate 100 representative inputs for the task. Call both the cheap model and the capable model on each input. Have human raters (or an LLM judge calibrated against human ratings) score quality for each output. Compute cost per call for each model. Third, compute quality-adjusted cost: if the cheap model achieves 85% of the quality of the capable model at 20% of the cost, that is usually an excellent tradeoff. If the cheap model achieves 60% of the quality, it may produce too many user-visible failures. The threshold depends on the task's visibility and failure cost. For grammar correction, a 15% quality shortfall might mean occasional missed errors - acceptable. For "generate a legal contract summary," a 15% quality shortfall might mean incorrect legal interpretations - not acceptable. The decision is never "which model is better" - it is always "which model is good enough for this specific task at what cost."
Q: What are the top three patterns you would apply when designing any new LLM-powered product feature from scratch?
A: First, decompose into a pipeline, not a single prompt. Every feature that starts as "one big prompt" ends up as a pipeline once you understand the failure modes. Start with the pipeline: input validation, classification or routing, task-specific handler, output validation, fallback. This structure makes each component independently testable and replaceable. Second, build the evaluation pipeline before the product. Assemble 50-100 labeled examples from user interviews, internal testing, or synthetic data before writing a line of application code. Every product decision - model selection, prompt changes, model updates - will be validated against this dataset. Teams that skip this step spend months debugging quality issues that a 100-example evaluation dataset would have caught in minutes. Third, design the failure path before the success path. What happens when the LLM returns a malformed response? What happens when it is confidently wrong? What happens when the provider is down? The failure paths determine how gracefully the product degrades and how often users escalate to humans. Design these paths first - they constrain the success path architecture more than the success path does.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Prompt Routing demo on the EngineersOfAI Playground - no code required.
:::
