What is GitHub Copilot architecture?

Five detailed production LLM architectures - GitHub Copilot, Notion AI, customer support bots, enterprise RAG, and code review agents - with real architecture decisions, scale numbers, and lessons learned.

How does Notion AI architecture work in practice?

Case Studies: Production LLM Systems covers GitHub Copilot architecture, Notion AI architecture, LLM production case study from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/llm-system-design/case-studies

What is the difference between GitHub Copilot architecture and LLM production case study?

See the full breakdown at https://engineersofai.com/docs/llms/llm-system-design/case-studies

Case Studies: Production LLM Systems

What Real Systems Look Like

There is a significant gap between the LLM tutorial and the production system. Tutorials show you how to call the API. They rarely show you how to handle 1.8 million users generating billions of API calls per day, how to keep a completion latency under 150ms when each inference call takes 800ms, or why the model you chose for your MVP is the wrong model for your third production incident.

The case studies in this lesson are not hypothetical. They are reconstructed from public engineering blog posts, conference talks, and technical interviews with engineers who built these systems. Where numbers are cited, they come from public sources. Where specific internal details are not available, the analysis is based on first-principles reasoning about the constraints any engineering team would face building these systems.

Five systems, five different problem domains, five different sets of tradeoffs. After each case study, the key patterns are extracted - these patterns appear across multiple systems and represent the closest thing to "best practices" that this field has developed.

Case Study 1: GitHub Copilot

Context

GitHub Copilot launched in October 2021, powered by Codex. By 2023 it had 1.8 million paid subscribers. It runs as an IDE extension that watches every keystroke and generates code completions in real time. The engineering constraints are severe: the completion must appear fast enough to feel instantaneous while the user is still typing - or it becomes a distraction rather than a tool.

Architecture

Context Assembly

The key engineering insight in Copilot is how context is assembled. The model receives a fixed-size context window, and filling it intelligently is the difference between good suggestions and irrelevant ones.

Context elements (roughly in priority order):

The file currently being edited (prefix up to cursor + suffix after cursor)
Recently edited files in the current session
Files open in other editor tabs
Files with similar names (e.g., the .test.ts file when editing a .ts file)
File path structure (the directory tree as text, limited)
Language-specific signals: imports, function signatures, class definitions

This is called "fill in the middle" (FIM): the model receives the code before AND after the cursor position and must predict what belongs in the gap. FIM-trained models (Codex, Code Llama, DeepSeek Coder) significantly outperform left-to-right models for mid-function completion.

Latency Architecture

150ms perceived latency is the target for inline suggestions. Achieving this requires several techniques working together.

Debouncing: don't send a request for every keystroke. Wait until the user has stopped typing for 75ms. If a new keystroke arrives before the request returns, cancel the old request and start fresh.

Speculative completions: pre-generate completions for likely cursor positions before the user arrives there. If the user accepts a completion, the extension can immediately start generating the next line speculatively.

Client-side cache: cache completions keyed by (file content hash, cursor position). If the user undoes a change and retypes the same code, the cached completion appears instantly.

Streaming: start displaying the completion as tokens arrive, not after the full completion is ready. The first token of a Codex/GPT-4 response arrives in ~200ms; by the time the user reads the first 10 tokens, the rest have arrived.

import asyncio
import hashlib
import time
from dataclasses import dataclass, field
from typing import Optional


@dataclass
class CompletionRequest:
    prefix: str      # code before cursor
    suffix: str      # code after cursor (FIM)
    language: str
    file_path: str
    cursor_line: int
    cursor_col: int


@dataclass
class CopilotCompletionEngine:
    """
    Simplified model of Copilot's debounce + cache + request management.
    """
    debounce_ms: float = 75.0
    cache: dict = field(default_factory=dict)
    pending_request: Optional[asyncio.Task] = None

    def _cache_key(self, req: CompletionRequest) -> str:
        content = f"{req.prefix}|||{req.suffix}|||{req.language}"
        return hashlib.sha256(content.encode()).hexdigest()[:16]

    async def get_completion(
        self,
        req: CompletionRequest,
        llm_fn,
    ) -> Optional[str]:
        # Check cache first
        key = self._cache_key(req)
        if key in self.cache:
            return self.cache[key]

        # Cancel any pending request
        if self.pending_request and not self.pending_request.done():
            self.pending_request.cancel()

        # Debounce: wait for typing to stop
        async def _delayed_call():
            await asyncio.sleep(self.debounce_ms / 1000)
            result = await llm_fn(req)
            if result:
                self.cache[key] = result
            return result

        self.pending_request = asyncio.create_task(_delayed_call())
        try:
            return await self.pending_request
        except asyncio.CancelledError:
            return None  # new keystroke arrived, request was cancelled

Scale and Model Selection

At 1.8M subscribers generating an average of ~100 suggestions per developer per hour (estimated from public talks), Copilot processes on the order of hundreds of millions of completions per day. At this scale, model selection is a cost and latency decision, not just a quality decision.

GitHub has been transparent about using different models for different contexts:

Inline completions (the primary product): latency-optimized model. GPT-4 was too slow for inline use - the first token arrives 400–600ms after the request, over the acceptable threshold. GitHub uses a smaller, faster model for inline completions.
Copilot Chat (conversational): quality-optimized model. GPT-4 is acceptable here because the user expects to wait 2–5 seconds for a chat response.

This model tiering by interaction type is a pattern that appears in every case study in this lesson.

Lessons Learned

Lesson 1: Latency beats quality for inline suggestions. A 95% quality completion that appears in 120ms is more useful than a 99% quality completion that appears in 400ms. Users will dismiss the 400ms suggestion before it fully renders.

Lesson 2: Context assembly quality is the primary quality lever. The model is fixed. You cannot make it smarter in production. But you can give it better context. Most of Copilot's quality improvements over time have come from better context assembly, not model size increases.

Lesson 3: Speculative execution hides latency. By pre-generating likely completions and caching at the client, the perceived latency is often 0ms - the completion was already there. This is the single most impactful latency optimization.

Case Study 2: Notion AI

Context

Notion AI launched in February 2023, integrated directly into the Notion workspace. Unlike Copilot (one task: code completion), Notion AI supports 15+ different task types: summarize, improve writing, fix spelling and grammar, continue writing, find action items, explain this, translate, and more. Each task has different quality requirements, different prompt structures, and different optimal models.

Multi-Task Architecture

Separation of Concerns: One Prompt Per Task

A naive implementation would use one general-purpose system prompt and put the task instruction in the user message. Notion's engineering team found that task-specific prompts dramatically outperform general-purpose prompts for formatting tasks. The "improve writing" task requires a specific tone and style profile. The "find action items" task requires structured JSON output. The "translate" task requires language-specific behavior.

from dataclasses import dataclass
from enum import Enum
from openai import OpenAI


class NotionTaskType(Enum):
    SUMMARIZE = "summarize"
    IMPROVE_WRITING = "improve_writing"
    FIX_GRAMMAR = "fix_grammar"
    CONTINUE_WRITING = "continue_writing"
    FIND_ACTION_ITEMS = "find_action_items"
    EXPLAIN = "explain"
    TRANSLATE = "translate"


@dataclass
class TaskConfig:
    system_prompt: str
    model: str
    temperature: float
    max_tokens: int
    response_format: str = "text"  # or "json"


TASK_CONFIGS: dict[NotionTaskType, TaskConfig] = {
    NotionTaskType.FIX_GRAMMAR: TaskConfig(
        system_prompt="""You are an expert editor. Fix only grammar, spelling, and punctuation errors.
Do NOT change the author's voice, tone, or word choices. Do NOT rewrite sentences unless grammatically broken.
Return only the corrected text with no explanation.""",
        model="gpt-4o-mini",    # Simple task -> cheap model
        temperature=0.1,        # Low temperature for deterministic edits
        max_tokens=2048,
    ),
    NotionTaskType.IMPROVE_WRITING: TaskConfig(
        system_prompt="""You are an expert writing coach. Improve the clarity, flow, and impact of this text.
Preserve the core meaning and the author's voice. Make it more engaging and professional.
Return only the improved text with no explanation.""",
        model="gpt-4o",         # Creative task -> capable model
        temperature=0.7,
        max_tokens=4096,
    ),
    NotionTaskType.FIND_ACTION_ITEMS: TaskConfig(
        system_prompt="""Extract all action items from the following text.
An action item is a task, commitment, or next step assigned to a person or unassigned.
Return a JSON array of objects with fields: {task: string, assignee: string|null, due: string|null}
Return only the JSON array, nothing else.""",
        model="gpt-4o-mini",
        temperature=0.0,        # Zero temperature for structured extraction
        max_tokens=1024,
        response_format="json",
    ),
    NotionTaskType.SUMMARIZE: TaskConfig(
        system_prompt="""Summarize the following text in 2-3 concise paragraphs.
Capture the main ideas, key decisions, and important details.
Write in a clear, professional tone. Return only the summary.""",
        model="gpt-4o-mini",
        temperature=0.3,
        max_tokens=512,
    ),
}


class NotionAI:
    def __init__(self, client: OpenAI):
        self.client = client

    def process(
        self,
        task_type: NotionTaskType,
        selected_text: str,
        page_context: str = "",
    ) -> str:
        config = TASK_CONFIGS[task_type]

        user_message = selected_text
        if page_context and task_type in [
            NotionTaskType.CONTINUE_WRITING,
            NotionTaskType.SUMMARIZE,
        ]:
            user_message = f"Page context:\n{page_context}\n\nSelected text:\n{selected_text}"

        kwargs = {
            "model": config.model,
            "messages": [
                {"role": "system", "content": config.system_prompt},
                {"role": "user", "content": user_message},
            ],
            "max_tokens": config.max_tokens,
            "temperature": config.temperature,
            "stream": True,
        }

        if config.response_format == "json":
            kwargs["response_format"] = {"type": "json_object"}

        return self.client.chat.completions.create(**kwargs)

Rate Limiting: Credit System

Notion AI uses a workspace-level credit system rather than per-user rate limits. Free workspaces get 20 AI responses per month total. Paid workspaces get unlimited access. This prevents a single heavy user from making the product feel limited to their whole team.

The credit system creates interesting engineering complexity: when a workspace is out of credits, you need to surface a clear error immediately (before calling the LLM), and you need to handle race conditions when two users hit the last credit simultaneously.

Lessons Learned

Lesson 1: Task-specific prompts and models beat a single general-purpose approach. A fine-tuned smaller model for grammar correction outperforms GPT-4 with a generic editing prompt, and costs 50× less per call. Invest in task decomposition upfront.

Lesson 2: Streaming is a product requirement, not an optimization. Users who see text appearing token by token perceive the response as faster and more intelligent than users who see a loading spinner followed by a block of text, even when the actual time-to-complete is identical. Implement streaming from day one.

Lesson 3: Model selection is a continuous optimization, not a one-time decision. Notion has changed their model selection multiple times - as new cheaper models became available, they re-evaluated which tasks could be downgraded. Treat model selection as an A/B testable parameter.

Case Study 3: Customer Support Bot

Context

This case study represents a pattern deployed by dozens of companies: an LLM-powered customer support assistant that handles Tier 1 inquiries. The architecture here is a composite based on multiple public descriptions from companies including Intercom, Zendesk, and several startups.

The key constraints for customer support: users expect near-human quality, mistakes have real business cost (wrong billing information, incorrect refund policy), and the system must know when to hand off to a human agent.

Architecture

Intent Classification and Routing

Before calling a powerful and expensive model, classify the user's intent with a cheap, fast classifier. This determines which handler processes the request.

from enum import Enum
from openai import OpenAI


class SupportIntent(Enum):
    FAQ = "faq"                          # Can answer from knowledge base
    ACCOUNT_ACTION = "account_action"    # Needs tool calls (account data)
    COMPLAINT = "complaint"              # High-stakes, route to human
    BILLING = "billing"                  # Needs billing system access
    UNKNOWN = "unknown"                  # Escalate


INTENT_PROMPT = """Classify this customer support message into one category:
- FAQ: general questions about product features, pricing, policies
- ACCOUNT_ACTION: requests that require looking up or modifying account data
- BILLING: questions or disputes about charges, invoices, refunds
- COMPLAINT: expressions of dissatisfaction, frustration, or formal complaints
- UNKNOWN: unclear or cannot be classified

Message: {message}

Respond with only the category name (FAQ, ACCOUNT_ACTION, BILLING, COMPLAINT, or UNKNOWN)."""


def classify_intent(message: str, client: OpenAI) -> SupportIntent:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        max_tokens=20,
        temperature=0,
        messages=[
            {
                "role": "user",
                "content": INTENT_PROMPT.format(message=message[:500]),
            }
        ],
    )
    raw = response.choices[0].message.content.strip().upper()
    try:
        return SupportIntent[raw]
    except KeyError:
        return SupportIntent.UNKNOWN

Human Handoff with Context Transfer

The handoff to a human agent is the most important interaction in a support bot. A botched handoff - where the human agent cannot see the conversation history, the user has to repeat themselves, and the context is lost - is worse than never having had the bot at all.

from dataclasses import dataclass
from typing import Optional
import json


@dataclass
class ConversationContext:
    conversation_id: str
    user_id: str
    messages: list[dict]
    intent_history: list[str]
    entities_extracted: dict   # account_id, order_id, etc. from tool calls
    escalation_reason: str
    bot_confidence: float
    suggested_resolution: Optional[str]


def prepare_agent_handoff(ctx: ConversationContext) -> dict:
    """
    Prepare a structured handoff package for the human agent.
    This appears in the agent's UI when they receive the conversation.
    """
    # Generate a summary for the human agent using LLM
    summary_prompt = f"""Summarize this customer conversation for a human support agent.
Include: what the customer needs, what the bot tried, why it escalated, and suggested next steps.
Keep it under 100 words.

Conversation: {json.dumps(ctx.messages[-6:], indent=2)}"""

    # (In production, call LLM here)
    # summary = llm.complete(summary_prompt)

    return {
        "conversation_id": ctx.conversation_id,
        "user_id": ctx.user_id,
        "full_transcript": ctx.messages,
        "escalation_reason": ctx.escalation_reason,
        "bot_confidence": ctx.bot_confidence,
        "entities": ctx.entities_extracted,  # pre-populated account data
        "suggested_resolution": ctx.suggested_resolution,
        # "summary": summary,
        "priority": "high" if ctx.bot_confidence < 0.3 else "normal",
    }

CSAT as Ground Truth for Evaluation

Post-conversation customer satisfaction (CSAT) surveys are the most honest evaluation signal available. Correlate CSAT scores with:

Which intent category the conversation was classified as
Whether the bot handled it or escalated
Which response model was used
Which prompt version generated the final response

import pandas as pd
import numpy as np


def analyze_csat_by_segment(
    conversations_df: pd.DataFrame,
) -> pd.DataFrame:
    """
    Group CSAT scores by key dimensions to identify improvement opportunities.
    """
    return (
        conversations_df.groupby(["intent_category", "handled_by", "model_used"])
        .agg(
            avg_csat=("csat_score", "mean"),
            response_count=("conversation_id", "count"),
            escalation_rate=("was_escalated", "mean"),
        )
        .reset_index()
        .sort_values("avg_csat", ascending=True)  # worst first
    )

Lessons Learned

Lesson 1: The intent classification step determines 80% of the outcome. A correctly classified intent routes to the right handler. A misclassified intent produces a bad response regardless of how good the handler is. Invest heavily in intent classifier accuracy, especially at the boundaries between categories.

Lesson 2: Design the escalation path before the bot path. Most teams design the happy path (bot handles everything) and retrofit escalation. The better approach is to design escalation first: what context does the human agent need? How does the conversation state transfer? Then build the bot path as a pre-filter that reduces escalation volume.

Lesson 3: CSAT correlation drives product decisions. When you can correlate CSAT scores with specific conversation patterns, you stop guessing about what to improve. "Billing questions handled by the bot have 2.1 CSAT vs 4.3 CSAT when handled by humans" is actionable - disable bot handling for billing, or invest in improving it.

Case Study 4: Enterprise Document Search (RAG at Scale)

Context

An enterprise software company with 5,000 employees needed a document search system that could answer questions across 200,000 internal documents: engineering specs, HR policies, sales playbooks, legal agreements, historical decisions. The challenge: multi-tenancy (different departments can see different documents), freshness (documents update frequently), and quality (wrong answers in an enterprise context have real consequences).

Indexing Pipeline

Query Pipeline

Multi-Tenant Access Control

The hardest engineering problem in enterprise RAG is not retrieval quality - it is access control. An employee in the marketing department must not receive answers grounded in confidential engineering roadmap documents. Access control must be enforced at query time, not at indexing time.

from dataclasses import dataclass
from typing import Optional


@dataclass
class UserContext:
    user_id: str
    department: str
    access_groups: list[str]  # e.g., ["engineering", "all-staff", "managers"]
    clearance_level: int       # 0 = public, 1 = internal, 2 = confidential, 3 = restricted


def build_access_filter(user: UserContext) -> dict:
    """
    Build a Pinecone/Weaviate metadata filter that enforces access control.
    Documents are indexed with their required access_groups.
    A document is accessible if ANY of its required groups matches ANY of the user's groups.
    """
    return {
        "$or": [
            {"access_groups": {"$in": user.access_groups}},
            {"access_groups": {"$eq": "all-staff"}},
        ],
        "clearance_level": {"$lte": user.clearance_level},
    }


async def rag_query(
    user_query: str,
    user: UserContext,
    vector_store,
    embedding_model,
    llm_client,
) -> dict:
    # Step 1: Embed query
    query_embedding = await embedding_model.embed(user_query)

    # Step 2: Retrieve with access filter
    access_filter = build_access_filter(user)
    raw_results = await vector_store.query(
        vector=query_embedding,
        filter=access_filter,
        top_k=20,
        include_metadata=True,
    )

    # Step 3: Rerank (cross-encoder, no access filter needed - already applied)
    reranked = await rerank(user_query, raw_results, top_k=5)

    # Step 4: Generate with citations
    context = "\n\n".join(
        f"[Source {i+1}: {r.metadata['doc_title']}]\n{r.text}"
        for i, r in enumerate(reranked)
    )

    response = await llm_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": """Answer the question using only the provided sources.
For every claim, cite the source number in brackets like [1] or [2].
If the sources do not contain enough information to answer, say so explicitly.""",
            },
            {
                "role": "user",
                "content": f"Sources:\n{context}\n\nQuestion: {user_query}",
            },
        ],
    )

    return {
        "answer": response.choices[0].message.content,
        "sources": [
            {
                "title": r.metadata["doc_title"],
                "url": r.metadata["doc_url"],
                "chunk": r.text[:200],
            }
            for r in reranked
        ],
    }

Freshness: Incremental Re-indexing

Enterprise documents change. An HR policy updated yesterday that the RAG system answers based on the old version is worse than no RAG system at all. Incremental re-indexing must be a first-class requirement.

import hashlib
import asyncio
from datetime import datetime


class IncrementalIndexer:
    """
    Tracks document versions and re-indexes only changed documents.
    """

    def __init__(self, vector_store, embedding_model, doc_registry):
        self.store = vector_store
        self.embedder = embedding_model
        self.registry = doc_registry  # DB table: doc_id, content_hash, indexed_at

    async def sync_document(self, doc_id: str, content: str, metadata: dict):
        content_hash = hashlib.sha256(content.encode()).hexdigest()
        existing = await self.registry.get(doc_id)

        if existing and existing["content_hash"] == content_hash:
            return  # no change, skip

        # Delete old chunks
        if existing:
            await self.store.delete(filter={"doc_id": {"$eq": doc_id}})

        # Index new chunks
        chunks = self._chunk(content)
        embeddings = await asyncio.gather(
            *[self.embedder.embed(chunk) for chunk in chunks]
        )

        vectors = [
            {
                "id": f"{doc_id}-chunk-{i}",
                "values": emb,
                "metadata": {
                    **metadata,
                    "doc_id": doc_id,
                    "chunk_index": i,
                    "chunk_text": chunk,
                    "indexed_at": datetime.utcnow().isoformat(),
                },
            }
            for i, (chunk, emb) in enumerate(zip(chunks, embeddings))
        ]

        await self.store.upsert(vectors)
        await self.registry.set(doc_id, content_hash)

    def _chunk(self, text: str, size: int = 512, overlap: int = 50) -> list[str]:
        words = text.split()
        chunks = []
        for i in range(0, len(words), size - overlap):
            chunk = " ".join(words[i:i + size])
            if chunk:
                chunks.append(chunk)
        return chunks

Lessons Learned

Lesson 1: Retrieval quality is 80% of the problem. The LLM cannot synthesize a good answer from irrelevant chunks. Before optimizing prompts, ensure your retrieval pipeline (hybrid search + reranking) returns relevant, high-quality chunks. Most production RAG failures are retrieval failures, not generation failures.

Lesson 2: Query expansion improves recall on enterprise vocabulary. Internal documents use domain-specific terminology. A user asking "what is the vacation policy" may not match a document titled "PTO accrual and carry-forward guidelines." Generate 2-3 alternative phrasings of each query before retrieval - cheap (gpt-4o-mini) and significantly improves recall.

Lesson 3: Access control at query time, not index time. Applying access filters at indexing (separate indices per department) creates maintenance nightmares when permissions change. Apply filters as metadata predicates at query time - the vector database filters out unauthorized chunks before retrieval.

Case Study 5: Code Review Agent

Context

A developer tools company built an automated code review agent that analyzes pull requests, identifies issues, suggests improvements, and posts structured comments. Unlike Copilot (synchronous, latency-critical), code review is asynchronous - the developer submits a PR and checks comments when ready. The acceptable latency window is minutes, not milliseconds. This changes the optimization target from latency to cost and thoroughness.

Agentic Architecture

Tool Definitions

from openai import OpenAI
import json
import subprocess

CODE_REVIEW_TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "search_codebase",
            "description": "Search the codebase for related files, functions, or patterns.",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "Search query (function name, pattern, or description)",
                    },
                    "file_type": {
                        "type": "string",
                        "enum": ["py", "ts", "js", "go", "java", "all"],
                        "description": "Filter by file extension",
                    },
                },
                "required": ["query"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "run_linter",
            "description": "Run the linter on a specific file and return issues found.",
            "parameters": {
                "type": "object",
                "properties": {
                    "file_path": {
                        "type": "string",
                        "description": "Path to the file to lint",
                    },
                },
                "required": ["file_path"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "get_function_definition",
            "description": "Get the full definition of a function or class in the codebase.",
            "parameters": {
                "type": "object",
                "properties": {
                    "symbol": {
                        "type": "string",
                        "description": "Function or class name to look up",
                    },
                    "file_path": {
                        "type": "string",
                        "description": "Optional: specific file to search in",
                    },
                },
                "required": ["symbol"],
            },
        },
    },
]


async def run_code_review_agent(
    pr_diff: str,
    repo_context: dict,
    client: OpenAI,
    max_iterations: int = 5,
) -> list[dict]:
    """
    Agentic code review: plan -> parallel tool calls -> synthesize comments.
    """
    messages = [
        {
            "role": "system",
            "content": """You are an expert code reviewer. Analyze this pull request and identify:
1. Correctness issues (bugs, logic errors, off-by-one errors)
2. Security vulnerabilities (SQL injection, XSS, insecure dependencies)
3. Performance problems (N+1 queries, unnecessary re-renders, memory leaks)
4. Code quality issues (missing error handling, unclear naming, missing tests)

Use the available tools to look up context before commenting.
Be specific: cite line numbers and explain why each issue matters.
Be constructive: suggest the fix, not just the problem.""",
        },
        {
            "role": "user",
            "content": f"""Review this pull request:

Repository context:
{json.dumps(repo_context, indent=2)}

PR Diff:
{pr_diff}

Use tools to look up any context you need, then provide structured review comments.""",
        },
    ]

    for iteration in range(max_iterations):
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=CODE_REVIEW_TOOLS,
            tool_choice="auto",
            temperature=0.2,  # Low temperature for deterministic review
            max_tokens=4096,
        )

        msg = response.choices[0].message
        messages.append(msg)

        # No more tool calls -> agent is done
        if not msg.tool_calls:
            break

        # Execute tool calls (can parallelize)
        import asyncio
        tool_results = await asyncio.gather(
            *[execute_tool(tc) for tc in msg.tool_calls]
        )

        for tool_call, result in zip(msg.tool_calls, tool_results):
            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": json.dumps(result),
            })

    # Extract final structured comments
    return parse_review_comments(messages[-1].content)


async def execute_tool(tool_call) -> dict:
    """Execute a single tool call. Real implementation connects to actual tools."""
    name = tool_call.function.name
    args = json.loads(tool_call.function.arguments)

    if name == "run_linter":
        # Run actual linter
        result = subprocess.run(
            ["pylint", args["file_path"], "--output-format=json"],
            capture_output=True,
            text=True,
        )
        return {"issues": json.loads(result.stdout or "[]")}

    elif name == "search_codebase":
        # Search using ripgrep or similar
        result = subprocess.run(
            ["rg", args["query"], "--json", "-l"],
            capture_output=True,
            text=True,
        )
        return {"files": [line for line in result.stdout.splitlines()[:10]]}

    return {"error": f"Unknown tool: {name}"}

Cost Analysis

At GPT-4o rates, a thorough code review costs $0.03–$ 0.08 per PR (2,000–8,000 input tokens for diff + context, 1,000–3,000 output tokens for comments). At 100 PRs/day, that is $3–$ 8/day, well within budget for most engineering organizations.

Cost optimization strategies:

Cheap model first pass: use GPT-4o-mini to triage the diff. Only send PRs above a complexity threshold to GPT-4o.
Context budget: limit codebase context to the 10 most relevant files. Searching the entire codebase increases cost without proportional quality gain.
Comment deduplication: if the agent produces 15 similar comments about the same pattern, collapse them into one. Fewer, better comments are more useful.
Low-temperature caching: code review for identical diffs can be cached. When developers re-push the same change, skip the agent call.

Determinism: Temperature and Caching

Code reviews benefit from low temperature (0.1–0.3): deterministic enough that the same diff produces consistent comments, but with enough variation to avoid formulaic output. At temperature 0, the agent becomes too rigid - it misses issues that require creative reasoning.

Caching strategy: hash the git diff (not the PR ID). If the same diff appears again (e.g., a force push that changes nothing), return the cached review. This is correct behavior: the code is the same, so the review should be the same.

Lessons Learned

Lesson 1: Agentic code review quality scales with tool quality, not model quality. The model's review is only as good as the context it receives. If search_codebase returns irrelevant files, the review will miss important patterns. Invest in tool implementation quality.

Lesson 2: Parallelizing tool calls is critical for latency. A sequential tool execution strategy (search → lint → docs → synthesize) takes 3-4 minutes. Parallel tool execution (search + lint + docs simultaneously) reduces wall time to 45-60 seconds. Always parallelize independent tool calls.

Lesson 3: Comment quality matters more than comment quantity. A review with 3 specific, actionable comments is more valuable than one with 15 generic observations. Implement a post-processing step that scores and filters comments by specificity and actionability before posting.

Patterns Across All Case Studies

Pattern 1: Model Tiering

Every production system uses at least two model tiers. The cheap, fast tier (GPT-4o-mini, Claude 3 Haiku) handles classification, routing, and simple tasks. The capable tier (GPT-4o, Claude 3.5 Sonnet) handles generation, reasoning, and complex analysis.

Tier	Models	Use Cases	Cost per 1K tokens
Fast/cheap	GPT-4o-mini, Claude 3 Haiku	classification, routing, extraction	$0.00015–$ 0.00025
Capable	GPT-4o, Claude 3.5 Sonnet	generation, reasoning, analysis	$0.0025–$ 0.003
Specialized	Codex for code, embedding models	code completion, search	varies

Pattern 2: Separation of Concerns

The worst LLM architectures try to do everything in one prompt. The best architectures separate concerns:

Input → Classification → Task-specific handler → Output validation → User

Each step has a clear responsibility. Each can be optimized, monitored, and replaced independently.

Pattern 3: Human-in-the-Loop at the Boundaries

Every case study has human escalation or review paths. The pattern is not "LLM does everything" - it is "LLM handles the easy cases so humans can focus on the hard cases."

The escalation trigger design matters:

Copilot: no escalation (the IDE is the safety net - bad completion is just not accepted)
Notion AI: the user accepts or rejects the suggestion - always human-in-the-loop
Support bot: explicit confidence thresholds, CSAT feedback loop
Enterprise RAG: groundedness check routes low-confidence answers to human review
Code review: low-severity issues posted automatically, high-severity issues flagged for human review

Pattern 4: Offline Evaluation Before Production Changes

Every team that has run a serious LLM system for more than three months has built an offline evaluation pipeline:

Maintain a labeled dataset of (input, expected_output) pairs
On every prompt change or model update, run the new version against the dataset
Compare aggregate quality scores (and per-example regressions) against the current production version
Block deployment if quality regresses on any segment of the dataset

This is the single most important practice for maintaining production quality over time.

Pattern 5: Instrument Everything, Sample for Review

Log every LLM request: prompt, response, model, tokens, latency, cost, user_id, feature_id, prompt_version. Store these logs in a queryable store. Sample 5–10% for automated quality evaluation. Review 20 random examples per week manually.

The combination of full logging and sampled review catches both aggregate trends (automated metrics) and novel failure modes (manual review).

Common Mistakes

:::danger Building the "do everything in one prompt" architecture When a feature grows from one use case to five, the temptation is to add conditions and special cases to a single growing system prompt. The result is a prompt that tries to be a product manager, a customer service agent, a technical writer, and a sales assistant simultaneously - and does none of them well. Decompose into task-specific handlers early. :::

:::danger Ignoring the escalation path until after launch The escalation path to a human agent is not a fallback to design later. It is a core product experience. Users who need a human agent and cannot find a clear path are the most frustrated users you have. Design the escalation experience first, then build the bot automation as a layer on top of it. :::

:::warning Treating evaluation as a launch-time activity "We evaluated the system before launch and it performed well." LLM systems degrade. Models change (providers update models silently). Prompts drift. Input distributions shift. Evaluation is a continuous activity, not a gate you pass once. Build the evaluation pipeline before you build the product. :::

:::danger Skipping access control in multi-tenant RAG It is tempting to build a single shared index and trust that the LLM "won't reveal" restricted documents. This is not access control - it is wishful thinking. The LLM has no concept of access permissions. It will reveal whatever is in its context window. Apply metadata filters at query time to enforce access control at the retrieval layer, before any document content reaches the LLM. :::

:::warning Optimizing for latency when the task is async Code review, document indexing, and batch analysis are async tasks where users do not feel latency directly. Over-optimizing for latency in these contexts wastes engineering effort and sacrifices quality. Prioritize thoroughness and cost efficiency for async tasks, latency for synchronous user-facing interactions. :::

Interview Questions

Q: Compare GitHub Copilot's architecture with a customer support bot. What are the fundamental differences in their design constraints?

A: Three fundamental differences. Latency: Copilot has a 150ms target - users feel anything slower as an interruption. A support bot has a 3–5 second acceptable window. This changes model selection (Copilot uses a smaller, faster model for inline; support uses a capable model for quality), caching (Copilot uses aggressive client-side speculative caching; support caches at the server level), and streaming (Copilot streams tokens as they arrive; support can also stream but the latency window is more forgiving). Error tolerance: Copilot errors are invisible - a bad suggestion is dismissed, the user types on. Support bot errors are highly visible - a wrong policy statement or a failed handoff is a support ticket. Copilot optimizes for volume of good suggestions; support optimizes for reliability. Interaction model: Copilot is one-shot (predict the next token given context). Support is multi-turn (maintain conversation state, handle topic changes, escalate on confidence thresholds). Support requires session management, intent tracking across turns, and a defined escalation path. Copilot has none of these.

Q: How would you design the access control system for an enterprise RAG application serving 5,000 employees across 20 departments?

A: Four-layer approach. First, document metadata at index time: when a document is indexed, store its access_groups (list of group IDs that can see it) and clearance_level (0-3) as vector metadata. This happens once at indexing and does not need to change unless document permissions change. Second, query-time filter: every search includes a metadata filter - only return chunks where the user's groups intersect with the chunk's access_groups, AND the clearance_level does not exceed the user's clearance. Never retrieve an unauthorized chunk and then filter in post-processing - that approach relies on the LLM not revealing the content, which is not a security guarantee. Third, citation provenance: every source cited in a response is verified against the access filter. If the response cites a document that the user should not see (a bug in the filter logic), catch it before returning. Fourth, audit logging: log every document retrieval with user_id, document_id, and the access filter applied. Quarterly access reviews cross-check these logs against current permission assignments. The critical constraint is that access control must be applied before any document content reaches the LLM context window - not after.

Q: Walk through how you would design the cost control system for an agentic code review bot processing 500 PRs per day.

A: At 500 PRs per day with an average cost of $0.05 per review, the baseline is$ 25/day ( $750/month). Cost control at four levels. First, complexity triage: before calling GPT-4o, classify the PR diff with GPT-4o-mini. PRs with fewer than 30 lines of changes and no architectural modifications get a cheap mini-model review ($ 0.005). Only complex PRs go to GPT-4o ( $0.05). If 60% of PRs are simple, daily cost drops from$ 25 to $16. Second, context budget: limit codebase search to the 5 most relevant files, not unlimited searching. More context increases cost without proportional quality gain. Cap the tool call round-trips at 5 per PR. Third, diff caching: hash the git diff. If the developer force-pushes without code changes, return the cached review. Common pattern for PRs that fail CI checks and re-push. Estimated 15-20% cache hit rate. Fourth, batch off-peak: code reviews do not need to run in real time. Queue them and process off-peak when API costs may be lower (some providers offer batch pricing). Report back within 10 minutes - acceptable for async code review. With all four optimizations, target:$ 8-12/day versus the $25 baseline.

Q: The Notion AI "fix grammar" task uses gpt-4o-mini while "improve writing" uses gpt-4o. How would you decide which model to use for a new task?

A: Model selection is a quality vs cost tradeoff that should be decided empirically, not by intuition. The decision framework has three steps. First, characterize the task: is it deterministic (grammar correction has one right answer per input) or creative (writing improvement has many acceptable outputs)? Deterministic tasks work well with cheaper models at low temperature. Creative tasks often require capable models to produce high-quality variation. Second, run a pilot experiment: generate 100 representative inputs for the task. Call both the cheap model and the capable model on each input. Have human raters (or an LLM judge calibrated against human ratings) score quality for each output. Compute cost per call for each model. Third, compute quality-adjusted cost: if the cheap model achieves 85% of the quality of the capable model at 20% of the cost, that is usually an excellent tradeoff. If the cheap model achieves 60% of the quality, it may produce too many user-visible failures. The threshold depends on the task's visibility and failure cost. For grammar correction, a 15% quality shortfall might mean occasional missed errors - acceptable. For "generate a legal contract summary," a 15% quality shortfall might mean incorrect legal interpretations - not acceptable. The decision is never "which model is better" - it is always "which model is good enough for this specific task at what cost."

Q: What are the top three patterns you would apply when designing any new LLM-powered product feature from scratch?

A: First, decompose into a pipeline, not a single prompt. Every feature that starts as "one big prompt" ends up as a pipeline once you understand the failure modes. Start with the pipeline: input validation, classification or routing, task-specific handler, output validation, fallback. This structure makes each component independently testable and replaceable. Second, build the evaluation pipeline before the product. Assemble 50-100 labeled examples from user interviews, internal testing, or synthetic data before writing a line of application code. Every product decision - model selection, prompt changes, model updates - will be validated against this dataset. Teams that skip this step spend months debugging quality issues that a 100-example evaluation dataset would have caught in minutes. Third, design the failure path before the success path. What happens when the LLM returns a malformed response? What happens when it is confidently wrong? What happens when the provider is down? The failure paths determine how gracefully the product degrades and how often users escalate to humans. Design these paths first - they constrain the success path architecture more than the success path does.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Prompt Routing demo on the EngineersOfAI Playground - no code required.

:::

What Real Systems Look Like​

Case Study 1: GitHub Copilot​

Context​

Architecture​

Context Assembly​

Latency Architecture​

Scale and Model Selection​

Lessons Learned​

Case Study 2: Notion AI​

Context​

Multi-Task Architecture​

Separation of Concerns: One Prompt Per Task​

Rate Limiting: Credit System​

Lessons Learned​

Case Study 3: Customer Support Bot​

Context​

Architecture​

Intent Classification and Routing​

Human Handoff with Context Transfer​

CSAT as Ground Truth for Evaluation​

Lessons Learned​

Case Study 4: Enterprise Document Search (RAG at Scale)​

Context​

Indexing Pipeline​

Query Pipeline​

Multi-Tenant Access Control​

Freshness: Incremental Re-indexing​

Lessons Learned​

Case Study 5: Code Review Agent​

Context​

Agentic Architecture​

Tool Definitions​

Cost Analysis​

Determinism: Temperature and Caching​

Lessons Learned​

Patterns Across All Case Studies​

Pattern 1: Model Tiering​

Pattern 2: Separation of Concerns​

Pattern 3: Human-in-the-Loop at the Boundaries​

Pattern 4: Offline Evaluation Before Production Changes​

Pattern 5: Instrument Everything, Sample for Review​

Common Mistakes​

Interview Questions​

What Real Systems Look Like

Case Study 1: GitHub Copilot

Context

Architecture

Context Assembly

Latency Architecture

Scale and Model Selection

Lessons Learned

Case Study 2: Notion AI

Context

Multi-Task Architecture

Separation of Concerns: One Prompt Per Task

Rate Limiting: Credit System

Lessons Learned

Case Study 3: Customer Support Bot

Context

Architecture

Intent Classification and Routing

Human Handoff with Context Transfer

CSAT as Ground Truth for Evaluation

Lessons Learned

Case Study 4: Enterprise Document Search (RAG at Scale)

Context

Indexing Pipeline

Query Pipeline

Multi-Tenant Access Control

Freshness: Incremental Re-indexing

Lessons Learned

Case Study 5: Code Review Agent

Context

Agentic Architecture

Tool Definitions

Cost Analysis

Determinism: Temperature and Caching

Lessons Learned

Patterns Across All Case Studies

Pattern 1: Model Tiering

Pattern 2: Separation of Concerns

Pattern 3: Human-in-the-Loop at the Boundaries

Pattern 4: Offline Evaluation Before Production Changes

Pattern 5: Instrument Everything, Sample for Review

Common Mistakes

Interview Questions