LLM Product Architecture
The Monday Morning Architecture Review
It's 9 AM on a Monday. Your startup just got its first 10,000 sign-ups over the weekend after a viral launch. The product is a writing assistant - users type a topic, and the LLM generates a structured article outline. Simple enough. Except now you're staring at the dashboard and every metric is red. P99 latency is 42 seconds. Error rate is 8%. Cost is $3,200 for the weekend alone, triple what was projected.
The CTO wants answers. Why is it slow? Why is it failing? Why does it cost so much? You open the codebase and see the architecture: a single Flask endpoint, one call to openai.chat.completions.create(), no caching, no timeouts, no retry logic, session data stored in a Python dict in memory, and every user sharing the same system prompt regardless of their subscription tier. The entire "architecture" is 80 lines of Python.
This is not a prompt engineering problem. The prompts are fine. The model is producing good outputs when it works. The problem is that you built a demo and called it a product. A demo proves that the LLM can do the task. A product proves that the system can do the task - reliably, cheaply, observably, and safely - for every user, in every condition, every time.
The path from demo to product requires thinking clearly about what kind of product you are building, because the architecture differs fundamentally across the three patterns. A chat interface has different requirements than a document processing pipeline, which has different requirements than an autonomous coding agent. Getting the architecture wrong at the beginning means rebuilding under load - the worst possible time.
This lesson builds the mental model and the code for designing LLM-powered products that work in production.
Why This Exists
Before LLM APIs, AI features in products required ML teams, training pipelines, serving infrastructure, and months of work. The barrier was so high that most companies simply didn't have AI features. LLM APIs changed that - now a senior engineer can add AI capabilities in a day.
But the ease of calling an API creates a dangerous illusion: that the architecture is simple. It isn't. You are now integrating a non-deterministic, stateful (conversation history), expensive, slow, and potentially unsafe external service into your product. Every one of those properties requires engineering solutions.
The industry learned this the hard way in 2023-2024 as the first wave of LLM products hit production. Teams discovered that:
- Prompt context accumulates unboundedly in conversational apps, driving costs exponential
- Model APIs have unpredictable latency spikes - P50 is fine, P99 is brutal
- Users share the same system prompt → one bad prompt change breaks everyone
- No rate limiting → one script kiddie can cost you $10,000 in API calls
- No output validation → malformed JSON from the LLM crashes downstream systems
LLM product architecture emerged as the engineering discipline that solves these problems systematically, rather than ad-hoc.
The Three LLM Product Patterns
Every LLM-powered product fits into one of three fundamental patterns. Understanding which pattern you are building determines every architectural decision that follows.
Pattern 1: Chat Interface
A conversational product where the user and the system exchange turns of text. The LLM maintains the illusion of a coherent conversation partner.
Characteristics:
- Multi-turn: each message depends on conversation history
- Stateful: history must be stored and retrieved per session
- User-driven: the system responds to user input, not a fixed script
- Open-ended: users can ask anything within the domain
Examples: ChatGPT, Claude.ai, customer support bots, coding assistants in IDEs, documentation Q&A bots
Core engineering challenge: History management. A conversation with 50 turns can consume 20,000+ tokens. At GPT-4 rates, that is 2,000 per minute.
When to use: When the task is inherently conversational, when users need to refine, clarify, and iterate, when the domain is broad enough that a scripted flow would miss most queries.
Pattern 2: Workflow Automation
A pipeline where the LLM handles the ambiguous, unstructured parts of an otherwise deterministic workflow.
Characteristics:
- Pipeline-shaped: input → step 1 → step 2 → ... → output
- Mixed: some steps are deterministic code, some are LLM calls
- Bounded: the scope of each LLM call is well-defined
- Often batch-compatible: many pipelines can run asynchronously
Examples: Document processing (extract key fields from invoices), content moderation, code review, report generation, data enrichment, email classification and routing
Core engineering challenge: Reliability. If step 3 of a 7-step pipeline fails, what happens? You need retries, dead-letter queues, partial result handling, and idempotency. The LLM is one component in a larger system that must be treated with the same fault tolerance as any other external dependency.
When to use: When you have a structured workflow with a few unstructured sub-tasks, when the output needs to feed into downstream systems, when you can define success criteria for each step, when batch processing is acceptable.
Pattern 3: Autonomous Agent
A system that takes a high-level goal, breaks it into steps, executes tools, observes results, and iterates until the goal is achieved or it determines it cannot be achieved.
Characteristics:
- Open-ended: the agent decides what steps to take
- Tool-using: the agent calls external systems (web search, code execution, APIs, databases)
- Multi-step: many LLM calls, not one
- Long-running: may take seconds to minutes per task
- Non-deterministic path: the sequence of steps is not known in advance
Examples: Devin (autonomous coding), AutoGPT-style research agents, customer support agents that can query databases and process refunds, business intelligence agents that write and execute SQL
Core engineering challenge: Reliability and cost control. Agents can get stuck in loops, take wrong paths, hallucinate tool calls, and accumulate massive context windows. An agent that makes 20 LLM calls per task at 2.00 per task - at 10,000 tasks per day that is $20,000/day. You need step budgets, loop detection, tool call validation, and cost caps.
When to use: When the task genuinely requires open-ended exploration, when the path to the answer is not known in advance, when users are willing to wait (10+ seconds), when failure is recoverable (not safety-critical).
Decision Framework: Which Pattern to Use?
:::tip Rule of Thumb Start with the simplest pattern that meets requirements. Most products that think they need an agent actually need a workflow with a few LLM steps. Agents are expensive, slow, and hard to test - only reach for them when the task is genuinely open-ended. :::
The Full Production Architecture
This is the architecture you should build toward, regardless of which pattern you choose. Start simpler, but design your abstractions to accommodate this shape.
The Orchestration Layer
The orchestration layer is the heart of any LLM application. It is the code that turns a user request into a model call and a model response into a user-facing result. Every non-trivial LLM product needs one.
What the orchestration layer does:
-
Prompt assembly - builds the full prompt from system instructions, conversation history, retrieved context, and the current user message. This is where all the complexity of context management lives.
-
RAG retrieval - given the user query, embeds it, searches the vector store, fetches relevant chunks, and injects them into the prompt.
-
Tool dispatch - for agentic systems, parses the model's tool call request, validates it, executes the tool (database query, API call, code execution), and returns the result to the model.
-
Output parsing - extracts structured data from model responses, handles JSON parsing errors, validates schemas, and routes the output to the appropriate downstream system.
-
Session management - loads conversation history from the database, manages history truncation, and persists new turns.
Frameworks for the orchestration layer:
- LangChain - the most popular, has connectors for everything, but can be opaque and hard to debug
- LlamaIndex - stronger for RAG-focused applications
- Raw API calls - more verbose but more transparent; good for teams that need full control
:::note For production systems, always understand what your framework is doing under the hood. LangChain abstractions that seem simple often make 3-5 API calls where you expect 1. :::
Session Management
A session is the context for a single user's interaction with your product. For a chat interface, a session corresponds to one conversation thread.
What a session stores:
- Session ID (UUID)
- User ID (foreign key to users table)
- Conversation history (list of
{role, content}turns) - Created timestamp and last activity timestamp
- Metadata: model used, system prompt version, feature flags
Storage options:
| Option | Use Case | Tradeoffs |
|---|---|---|
| In-memory (dict) | Development only | Lost on restart; no multi-instance support |
| Redis | Active session cache | Fast; TTL-based expiry; not durable |
| PostgreSQL | Durable session store | Slower; durable; queryable for analytics |
| Hybrid: Redis + Postgres | Production | Redis for active sessions, Postgres as source of truth |
Session expiry: Always set TTL on sessions. Sessions that haven't been active for 24-48 hours should be archived or deleted. This prevents unbounded storage growth and reduces the risk of stale context affecting responses.
Async vs Sync: Streaming and Background Jobs
Synchronous (request-response): User sends message, waits for complete response. Simple to implement, poor UX for long responses.
Streaming (SSE or WebSocket): Response tokens are sent to the client as they are generated. Much better UX - users see progress, perceived latency drops dramatically. The first token typically arrives in 200-500ms even if the full response takes 10 seconds.
Server-Sent Events (SSE) is the simplest streaming implementation for HTTP/1.1:
- One-way: server pushes, client receives
- Automatic reconnection on disconnect
- Built into browsers, no extra library needed on the client
WebSocket is better when:
- You need bidirectional streaming
- You need very low latency (under 50ms)
- The client needs to interrupt or modify in-flight requests
Background jobs are appropriate for:
- Long-running agent tasks (30+ seconds)
- Batch processing pipelines
- Tasks where the user doesn't need to wait (e.g., "we'll email you when it's done")
Multi-Tenancy
In a multi-tenant LLM product, different users or organizations get different behavior from the same system. This is both a product requirement and a cost management requirement.
Dimensions of multi-tenancy:
- System prompt customization - each tenant has their own system prompt defining the assistant's persona, capabilities, and restrictions
- Model selection - paid tiers get GPT-4, free tiers get GPT-3.5 or a local model
- Rate limits - each tenant has separate rate limit buckets
- Data isolation - each tenant's conversation history is partitioned; no cross-tenant data leakage
- Feature flags - certain features (image input, tool use, long context) gated by tier
@dataclass
class TenantConfig:
tenant_id: str
system_prompt: str
model: str # "gpt-4o" | "gpt-4o-mini" | "claude-3-5-sonnet"
max_tokens_per_request: int
requests_per_minute: int
max_context_tokens: int
features: set[str] # {"rag", "tools", "images"}
:::warning Never share system prompts across tenants. A system prompt that says "You are a financial advisor for Acme Corp" should not leak to Acme Corp's competitor. Isolate system prompts in your database, keyed by tenant ID, never hardcoded. :::
API Design: REST vs WebSocket vs SSE
| Scenario | Recommended | Reason |
|---|---|---|
| Non-streaming query-response | REST POST | Simple, cacheable, stateless |
| Streaming chat response | SSE (GET with event-stream) | Simple to implement, browser-native |
| Real-time bidirectional | WebSocket | Full-duplex, needed for voice or live collaboration |
| Batch pipeline | REST POST + polling / webhooks | Async-safe, no connection held open |
Code: FastAPI LLM Service with Streaming, Session Management, Multi-Tenancy
# llm_service.py
import asyncio
import hashlib
import json
import time
import uuid
from dataclasses import dataclass, field
from typing import AsyncGenerator
import redis.asyncio as redis
from fastapi import Depends, FastAPI, HTTPException, Request
from fastapi.responses import StreamingResponse
from openai import AsyncOpenAI
from pydantic import BaseModel
from sqlalchemy.ext.asyncio import AsyncSession
# ─── Configuration ───────────────────────────────────────────────────────────
TENANT_CONFIGS = {
"free": TenantConfig(
tenant_id="free",
system_prompt="You are a helpful assistant. Be concise.",
model="gpt-4o-mini",
max_tokens_per_request=1000,
requests_per_minute=10,
max_context_tokens=8192,
features=set(),
),
"pro": TenantConfig(
tenant_id="pro",
system_prompt="You are an expert assistant with deep technical knowledge.",
model="gpt-4o",
max_tokens_per_request=4000,
requests_per_minute=60,
max_context_tokens=32768,
features={"rag", "tools"},
),
}
@dataclass
class TenantConfig:
tenant_id: str
system_prompt: str
model: str
max_tokens_per_request: int
requests_per_minute: int
max_context_tokens: int
features: set = field(default_factory=set)
# ─── Session Manager ─────────────────────────────────────────────────────────
class SessionManager:
"""Manages conversation sessions backed by Redis (active) + Postgres (durable)."""
def __init__(self, redis_client: redis.Redis):
self.redis = redis_client
self.session_ttl = 86400 # 24 hours
async def get_or_create_session(
self, session_id: str, user_id: str, tenant_id: str
) -> dict:
key = f"session:{session_id}"
data = await self.redis.get(key)
if data:
session = json.loads(data)
# Refresh TTL on access
await self.redis.expire(key, self.session_ttl)
return session
# New session
session = {
"session_id": session_id,
"user_id": user_id,
"tenant_id": tenant_id,
"history": [],
"created_at": time.time(),
"last_active": time.time(),
"metadata": {
"message_count": 0,
"total_tokens": 0,
},
}
await self._save_session(session)
return session
async def append_turn(
self, session: dict, role: str, content: str, tokens_used: int = 0
) -> dict:
session["history"].append({"role": role, "content": content})
session["last_active"] = time.time()
session["metadata"]["message_count"] += 1
session["metadata"]["total_tokens"] += tokens_used
await self._save_session(session)
return session
async def _save_session(self, session: dict):
key = f"session:{session['session_id']}"
await self.redis.setex(key, self.session_ttl, json.dumps(session))
# ─── Rate Limiter ─────────────────────────────────────────────────────────────
class RateLimiter:
"""Sliding window rate limiter using Redis sorted sets."""
def __init__(self, redis_client: redis.Redis):
self.redis = redis_client
async def check_and_increment(
self, tenant_id: str, user_id: str, limit: int, window_seconds: int = 60
) -> bool:
key = f"rate:{tenant_id}:{user_id}"
now = time.time()
window_start = now - window_seconds
pipe = self.redis.pipeline()
# Remove old entries outside the window
pipe.zremrangebyscore(key, 0, window_start)
# Count entries in the current window
pipe.zcard(key)
# Add current request
pipe.zadd(key, {str(uuid.uuid4()): now})
pipe.expire(key, window_seconds)
results = await pipe.execute()
current_count = results[1]
return current_count < limit
# ─── Prompt Assembler ────────────────────────────────────────────────────────
class PromptAssembler:
"""Assembles the messages array for a model call."""
def build(
self,
config: TenantConfig,
history: list[dict],
user_message: str,
retrieved_context: str = "",
) -> list[dict]:
messages = [{"role": "system", "content": config.system_prompt}]
# Inject RAG context if available
if retrieved_context and "rag" in config.features:
context_message = (
f"Relevant context from knowledge base:\n\n{retrieved_context}\n\n"
f"Use the above context to answer the question if relevant."
)
messages.append({"role": "system", "content": context_message})
# Add conversation history (with truncation - see Module 03)
messages.extend(self._truncate_history(history, config.max_context_tokens))
# Add current user message
messages.append({"role": "user", "content": user_message})
return messages
def _truncate_history(
self, history: list[dict], max_tokens: int
) -> list[dict]:
# Simple approximation: 1 token ≈ 4 characters
budget = max_tokens * 0.6 # reserve 40% for system prompt + new message
truncated = []
char_count = 0
for turn in reversed(history):
turn_chars = len(turn["content"])
if char_count + turn_chars > budget:
break
truncated.insert(0, turn)
char_count += turn_chars
return truncated
# ─── LLM Service ─────────────────────────────────────────────────────────────
class LLMService:
def __init__(self):
self.client = AsyncOpenAI()
self.assembler = PromptAssembler()
async def stream_completion(
self,
messages: list[dict],
model: str,
max_tokens: int,
) -> AsyncGenerator[str, None]:
"""Stream tokens as SSE-compatible chunks."""
stream = await self.client.chat.completions.create(
model=model,
messages=messages,
max_tokens=max_tokens,
stream=True,
)
async for chunk in stream:
delta = chunk.choices[0].delta
if delta.content:
yield f"data: {json.dumps({'token': delta.content})}\n\n"
yield "data: [DONE]\n\n"
# ─── FastAPI Application ──────────────────────────────────────────────────────
app = FastAPI()
class ChatRequest(BaseModel):
message: str
session_id: str | None = None
async def get_redis() -> redis.Redis:
return redis.from_url("redis://localhost:6379", decode_responses=True)
async def get_current_user(request: Request) -> dict:
"""Extract user context from JWT - simplified for illustration."""
token = request.headers.get("Authorization", "").removeprefix("Bearer ")
# In production: validate JWT, extract user_id and tenant_id
return {"user_id": "user-123", "tenant_id": "pro"}
@app.post("/chat")
async def chat_endpoint(
req: ChatRequest,
user: dict = Depends(get_current_user),
redis_client: redis.Redis = Depends(get_redis),
):
config = TENANT_CONFIGS.get(user["tenant_id"])
if not config:
raise HTTPException(status_code=400, detail="Unknown tenant")
# Rate limiting
limiter = RateLimiter(redis_client)
allowed = await limiter.check_and_increment(
user["tenant_id"], user["user_id"], config.requests_per_minute
)
if not allowed:
raise HTTPException(status_code=429, detail="Rate limit exceeded")
# Session management
session_manager = SessionManager(redis_client)
session_id = req.session_id or str(uuid.uuid4())
session = await session_manager.get_or_create_session(
session_id, user["user_id"], user["tenant_id"]
)
# Prompt assembly
llm_service = LLMService()
messages = llm_service.assembler.build(
config=config,
history=session["history"],
user_message=req.message,
)
# Save user turn before streaming
await session_manager.append_turn(session, "user", req.message)
# Collect full response for session storage
full_response = []
async def response_generator():
async for chunk in llm_service.stream_completion(
messages, config.model, config.max_tokens_per_request
):
if chunk != "data: [DONE]\n\n":
data = json.loads(chunk[6:]) # strip "data: "
full_response.append(data.get("token", ""))
yield chunk
# After streaming completes, save assistant turn
assistant_message = "".join(full_response)
await session_manager.append_turn(session, "assistant", assistant_message)
return StreamingResponse(
response_generator(),
media_type="text/event-stream",
headers={
"X-Session-Id": session_id,
"Cache-Control": "no-cache",
"Connection": "keep-alive",
},
)
Production: Service Dependencies and Failure Modes
Understanding failure modes for each component is essential for building reliable systems.
| Component | Failure Mode | Impact | Mitigation |
|---|---|---|---|
| LLM API | Timeout (P99 can be 30s+) | User sees hang | Set aggressive timeouts (10-15s), show progress indicator |
| LLM API | Rate limit (429) | Feature unavailable | Exponential backoff, fallback to cheaper model |
| LLM API | Model degradation | Bad outputs | Output validation, fallback responses |
| Redis | Connection lost | Session data lost | Reconnection retry, fallback to Postgres |
| Vector DB | Slow retrieval | High latency | Cache embeddings, set retrieval timeout (2s max) |
| Postgres | Slow query | High latency | Index on session_id + user_id, async writes |
| Rate limiter | Redis failure | No rate limiting | Fail open (allow requests) or fail closed (block) |
:::danger The most dangerous production failure mode is the LLM API producing malformed or semantically wrong output silently. Silent failures are worse than loud ones because they corrupt data and degrade user experience without triggering any alerts. Always validate model outputs before using them. :::
:::warning The session history unbounded growth problem is not theoretical - it is the most common cause of unexpected cost spikes in production chat applications. A user who has been chatting for 2 hours has accumulated thousands of tokens of history. If you send that full history on every turn, you are paying for tokens you already paid for. Implement truncation from day one. :::
Common Mistakes
:::danger Mistake: Storing conversation history in application memory If your LLM service has 3 replicas, and a user's requests are load-balanced across them, each replica has a different view of the conversation history. The user sees incoherent, context-less responses. Always store session state in Redis or the database, not in application memory. :::
:::danger Mistake: Using one model for all tasks If your product has 10 different features (summarize, classify, generate, extract, answer Q&A...) and you use GPT-4 for all of them, you are paying GPT-4 prices for tasks that GPT-3.5 or a fine-tuned small model could handle at 1/10th the cost with equivalent quality. Model tiering is one of the highest-ROI cost optimizations. :::
:::warning Mistake: No request timeout on LLM API calls LLM APIs can take 30-60 seconds on P99 in high-load conditions. If you don't set a timeout, your API server threads/connections will hang, and under load you will experience a cascade failure where all threads are waiting on the LLM. Always set a timeout (10-15 seconds for non-streaming, 30 seconds for streaming) and handle the timeout gracefully. :::
:::warning Mistake: Choosing the agent pattern when workflow automation suffices Autonomous agents are exciting but expensive, slow, and unpredictable. Many problems that feel like they need an agent - "browse the web and summarize these 5 pages," "query this database and generate a report" - are actually deterministic workflows with one or two LLM steps. Before choosing the agent pattern, try modeling the task as a fixed workflow. If the workflow covers 80%+ of cases, ship the workflow. :::
Interview Q&A
Q1: A company wants to build an LLM-powered customer support product. Walk me through the architecture decisions you would make.
Start by clarifying the product requirements: Does it need to resolve tickets autonomously, or assist human agents? How many concurrent users? What is the latency requirement? What is the acceptable cost per ticket?
For autonomous customer support, I would use Pattern 2 (workflow automation) not Pattern 3 (agent). The workflow: (1) classify intent with a small, fast model - billing issue, technical issue, return request, general inquiry; (2) route to the appropriate sub-workflow; (3) for RAG-answerable queries, retrieve relevant documentation and generate a response; (4) for action-required queries (refunds, account changes), verify user identity and call backend APIs; (5) if confidence is low or the query is out-of-scope, route to human agent.
The architecture: API Gateway → Auth middleware → Rate limiter → Orchestration service → Intent classifier (small model) → RAG pipeline (for documentation queries) or Action executor (for account operations) → LLM for response generation → Output validation → Response. Redis for session state, Postgres for ticket history and audit logs.
Q2: How do you handle the multi-tenancy problem in an LLM product - where different customers need different model behavior?
Multi-tenancy in LLM systems has three layers: configuration, data isolation, and cost isolation.
Configuration: Store each tenant's system prompt, model selection, and feature flags in the database. Load the tenant config at the start of each request using the tenant ID from the JWT. Never hardcode system prompts.
Data isolation: Each tenant's conversation history is stored with a tenant_id partition key. Application-level access control ensures queries filter by tenant_id. For high-security tenants (regulated industries), consider separate database schemas or even separate deployments.
Cost isolation: Track token usage per tenant in your observability system. Set per-tenant budget limits enforced at the gateway layer. Alert when a tenant approaches their quota.
Q3: What is the orchestration layer and why does it need to be a separate service?
The orchestration layer is the code that turns a user request into an LLM call and a model response into a product action. It handles prompt assembly, RAG retrieval, tool dispatch, and output parsing.
Whether it should be a separate service depends on scale. At low scale (under 100 req/s), you can embed orchestration in your main API service. As you scale, you want to separate it because: (1) it is the component that evolves most rapidly as you tune prompts and add features; (2) LLM calls are slow and you don't want them blocking your main API thread pool; (3) you may want to run orchestration asynchronously for non-real-time tasks; (4) observability is cleaner when all LLM traffic flows through one service.
Q4: When would you use Server-Sent Events vs WebSockets for streaming LLM responses?
SSE is almost always the right choice for streaming LLM responses. SSE is HTTP/1.1 compatible, works through proxies and CDNs, supports automatic reconnection, is natively supported by browsers with EventSource, and requires no special server configuration. The one-directional nature of SSE is not a limitation for LLM streaming - the user sends a request, the server streams the response.
WebSocket is better for voice interfaces (where the client is also streaming audio to the server), for collaborative features (multiple users in the same session), or for low-latency (sub-50ms) bidirectional communication. WebSocket requires maintaining a persistent connection, which increases server-side resource consumption and complicates load balancing.
Q5: You are seeing P99 latency of 40 seconds on your LLM product. Walk me through how you would diagnose and fix this.
First, decompose the latency. Add timing instrumentation to measure: (a) time to first LLM API call (includes auth, rate limit check, prompt assembly, RAG retrieval); (b) time to first token from LLM (prefill time - scales with input prompt length); (c) time from first to last token (decode time - scales with output length × TPOT); (d) post-processing time. P99 of 40s almost always indicates one of: very long prompts (bloated history or large RAG context), very long outputs (no max_tokens limit), or LLM API overload.
Fixes in priority order: (1) Reduce output length - set max_tokens appropriate to the task; (2) Reduce prompt length - implement history truncation, compress RAG context; (3) Enable streaming - moves perceived P99 from "time to complete response" to "time to first token" (~500ms), which transforms the UX even if total latency is unchanged; (4) Add a fallback model - if the primary model times out after 10s, retry with a faster model; (5) Check if you are hitting rate limits that cause queuing delays on the provider side.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the LLM Product Architecture demo on the EngineersOfAI Playground - no code required.
:::
