What is llm inference?

Deploying Llama-3-70B for a 100K DAU application - vLLM serving, tensor parallelism, KV cache management, speculative decoding, LoRA serving, cost management, and RAG integration.

How does vllm serving work in practice?

Large Language Model Systems covers llm inference, vllm serving, tensor parallelism from first principles with code examples. Free lesson at https://engineersofai.com/docs/ai-systems/case-studies/large-language-model-systems

What is the difference between llm inference and tensor parallelism?

See the full breakdown at https://engineersofai.com/docs/ai-systems/case-studies/large-language-model-systems

:::tip 🎮 Interactive Playground Visualize this concept: Try the LLM Product Architecture demo on the EngineersOfAI Playground - no code required. :::

Large Language Model Systems

Deploying Llama-3-70B for 100,000 Daily Active Users

The decision to self-host a 70B parameter model instead of using the OpenAI API was straightforward on a spreadsheet. At 100K daily active users, each generating an average of 10 API calls per day with 2,000 input tokens and 500 output tokens, the OpenAI API cost at $30/million output tokens would be$ 1.5 million per month. Running Llama-3-70B on 8x A100 80GB GPUs, amortized, costs roughly $15,000 per month - a 100x cost reduction.

Then reality arrived. The inference team's first attempt - wrapping Llama in a Python FastAPI server and serving with Hugging Face Transformers - could handle 3 requests per second. They needed 500. Memory consumption was 140GB of VRAM at batch size 1 - their cluster had 8x A100s (640GB total), which should be enough, but memory fragmentation caused OOM errors at batch size 4. Output latency was 45 seconds per response - users were abandoning requests.

Three months later, after implementing tensor parallelism, continuous batching, PagedAttention, and speculative decoding, the system handles 2,000 tokens per second of output throughput across 500 concurrent requests, with p99 time-to-first-token under 2 seconds and p99 total latency under 8 seconds for a 500-token response. This case study covers every step of that journey.

Requirements Analysis

Functional requirements:

Serve Llama-3-70B for chat, completion, and RAG-augmented question answering
Support streaming responses (tokens appear incrementally)
Support LoRA fine-tuned variants (custom assistant persona, domain-specific models)
Integrate with knowledge base via RAG pipeline
Provide usage metering per user/tenant

Non-functional requirements:

Throughput: 2,000 tokens/second sustained output throughput
Latency: time-to-first-token under 2 seconds, p99 total latency under 8 seconds for 500 tokens
Availability: 99.9% uptime
Cost: under $20,000/month for 100K DAU
Scale: handle 10x traffic spikes (product launches, marketing events)

Hardware baseline:

Llama-3-70B in BF16: 70B parameters × 2 bytes = 140GB
Requires at least 4x A100 80GB GPUs for the model weights alone
KV cache requires additional VRAM - at batch size 100, 2048 context: approximately 50-100GB additional
Recommended: 8x A100 80GB (640GB total)

LLM Inference Fundamentals

Why LLM Inference is Hard

LLM inference has two phases with fundamentally different compute characteristics:

Prefill phase: Process all input tokens in parallel. Highly compute-bound. One forward pass through the full network for all input tokens simultaneously. Takes 0.5-2 seconds for long prompts.

Decode phase: Generate output tokens one at a time, autoregressively. Each forward pass generates exactly one token. Memory-bandwidth bound - moving the model weights from HBM to compute cores is the bottleneck, not compute itself.

This asymmetry means you cannot simply batch decode requests like you batch training - each request is at a different decode position, requiring different KV cache reads. Naively, a batch of 100 requests at decode time is nearly as slow as 1 request because the memory bandwidth bottleneck does not benefit from batching.

The key innovations in LLM serving (vLLM, TGI) are all about managing this constraint.

The KV Cache Problem

During decode, each attention layer must attend to all previous tokens. The key-value activations for previous tokens are cached in the KV cache to avoid recomputation:

$\text{KV cache size} = 2 \times n_{\text{layers}} \times n_{\text{heads}} \times d_{\text{head}} \times \text{seq\_len} \times \text{bytes\_per\_element}$

For Llama-3-70B (80 layers, 8 GQA groups, head dimension 128, in FP16):

$\text{KV cache per token} = 2 \times 80 \times 8 \times 128 \times 2 = 327,680 \text{ bytes} \approx 320\text{KB per token}$

For a sequence of 2,048 tokens: 320KB × 2048 = 640MB per sequence. At batch size 100: 64GB just for KV cache.

This is the KV cache memory explosion problem. With 640GB total VRAM (8x A100 80GB), after model weights (140GB) and KV cache (64GB at batch 100), you have limited buffer for longer sequences or larger batches.

System Architecture

Component 1: vLLM with PagedAttention

vLLM's core innovation is PagedAttention - managing the KV cache like an operating system manages virtual memory.

Instead of pre-allocating a contiguous block of VRAM for each sequence's KV cache (which wastes memory for sequences that don't reach their maximum length), PagedAttention allocates KV cache in small fixed-size blocks (16 tokens per block). A logical sequence's KV cache is stored in non-contiguous physical blocks, with a page table mapping logical positions to physical blocks.

Benefits:

No wasted allocation: sequences only use the VRAM they actually need
KV cache sharing: sequences sharing a prefix (common system prompts) share the same physical KV cache blocks - significant savings for multi-tenant deployments
Continuous batching: the scheduler can pack many short sequences into the same GPU pass, dramatically improving throughput

from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
from typing import List, Optional, AsyncGenerator
import asyncio


class ProductionLLMServer:
    """
    Production LLM serving wrapper around vLLM.
    Handles: multi-GPU, LoRA adapters, streaming, request queuing.
    """

    def __init__(
        self,
        model_path: str = "meta-llama/Meta-Llama-3-70B-Instruct",
        tensor_parallel_size: int = 8,  # number of GPUs
        gpu_memory_utilization: float = 0.90,  # fraction of VRAM for KV cache
        max_model_len: int = 8192,
        enable_lora: bool = True,
        max_lora_rank: int = 64,
    ):
        self.llm = LLM(
            model=model_path,
            tensor_parallel_size=tensor_parallel_size,
            gpu_memory_utilization=gpu_memory_utilization,
            max_model_len=max_model_len,
            enable_lora=enable_lora,
            max_lora_rank=max_lora_rank,
            # Speculative decoding configuration
            speculative_model="meta-llama/Meta-Llama-3-8B-Instruct",
            num_speculative_tokens=5,
            # Prefix caching for shared system prompts
            enable_prefix_caching=True,
        )

    async def generate_stream(
        self,
        prompt: str,
        max_tokens: int = 512,
        temperature: float = 0.7,
        lora_adapter_id: Optional[str] = None,
    ) -> AsyncGenerator[str, None]:
        """Stream tokens as they are generated."""
        sampling_params = SamplingParams(
            max_tokens=max_tokens,
            temperature=temperature,
            top_p=0.9,
            stop=["</s>", "<|eot_id|>"],
        )

        lora_request = None
        if lora_adapter_id:
            lora_request = LoRARequest(
                lora_name=lora_adapter_id,
                lora_int_id=hash(lora_adapter_id) % 1000,
                lora_local_path=f"/adapters/{lora_adapter_id}",
            )

        async for output in self.llm.generate_async(
            prompt,
            sampling_params,
            lora_request=lora_request,
        ):
            if output.outputs:
                yield output.outputs[0].text

    def get_metrics(self) -> dict:
        """Return current serving metrics."""
        return {
            "num_running": self.llm.llm_engine.scheduler.num_running,
            "num_waiting": self.llm.llm_engine.scheduler.num_waiting,
            "gpu_cache_usage": self.llm.llm_engine.scheduler.gpu_cache_usage,
            "tokens_per_second": self.llm.llm_engine.get_throughput(),
        }

Component 2: Tensor Parallelism

Llama-3-70B's 140GB weight requirement exceeds the 80GB VRAM of a single A100. Tensor parallelism splits the model across multiple GPUs - each GPU holds a slice of each attention layer and FFN layer.

Each GPU holds 1/N of each transformer layer's weights. After each layer's computation, the partial results are summed across GPUs via AllReduce (NVLink for intra-node, InfiniBand for inter-node). For 8-GPU tensor parallelism with NVLink, the AllReduce overhead is 3-5ms per layer - for 80 layers, 240-400ms added to prefill latency. This is why tensor parallelism beyond 8 GPUs on a single node often does not improve throughput: communication overhead grows faster than compute parallelism.

Component 3: Speculative Decoding

Speculative decoding dramatically improves decode throughput by using a small "draft" model to propose multiple tokens, then verifying them in parallel with the large "target" model.

How it works:

The draft model (e.g., Llama-3-8B) generates $k$ speculative tokens in $k$ sequential forward passes
The target model (Llama-3-70B) verifies all $k$ tokens in a single forward pass
Accepted tokens are kept; the first rejected token is replaced with the target model's correct prediction
If all $k$ tokens are accepted, the system gains $k$ tokens for the cost of approximately $1$ large model forward pass

The speedup depends on the acceptance rate: if the draft model's tokens match the target model's output 70% of the time, you get roughly $\frac{k \cdot 0.7}{1} = 3.5$ effective tokens per large-model forward pass instead of 1.

Speculative decoding works best when:

The draft model is 10-20x smaller than the target model (similar architecture family)
The generation temperature is moderate (0.3-0.8) - at temperature 0, greedy decoding from draft and target match frequently; at very high temperature, predictions diverge and acceptance rates drop
Output is predictable (common phrases, structured formats)

Component 4: LoRA Adapter Serving

For multi-tenant applications (different customers getting different model personalities or domain-specific capabilities), serving separate fine-tuned models is prohibitively expensive. LoRA (Low-Rank Adaptation) lets you store only the delta weights for each variant.

LoRA adds trainable rank-decomposition matrices to each transformer layer:

$W = W_0 + \Delta W = W_0 + BA$

where $W_0$ is the frozen base weight, $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$ are the LoRA matrices, and $r \ll \min(d, k)$ is the rank.

For Llama-3-70B with rank $r=64$ :

Base model size: 140GB
LoRA adapter size: approximately 400MB per variant
You can serve 100+ LoRA variants on the same GPU cluster

vLLM supports hot-loading LoRA adapters - a request for tenant A loads adapter A; a request for tenant B loads adapter B - without reloading the base model.

Component 5: RAG Integration

For knowledge-intensive use cases, integrate the LLM with a vector knowledge base:

from typing import List, Dict
import asyncio


class RAGAugmentedLLM:
    """
    Retrieval-Augmented LLM: retrieves relevant documents before generation.
    """

    def __init__(
        self,
        llm_server: ProductionLLMServer,
        vector_store,
        reranker,
        system_prompt: str,
        max_context_tokens: int = 3000,
        retrieval_k: int = 20,
        rerank_k: int = 4,
    ):
        self.llm = llm_server
        self.vector_store = vector_store
        self.reranker = reranker
        self.system_prompt = system_prompt
        self.max_context_tokens = max_context_tokens
        self.retrieval_k = retrieval_k
        self.rerank_k = rerank_k

    async def generate(self, user_query: str, lora_adapter: str = None) -> AsyncGenerator:
        # Step 1: Retrieve relevant documents
        docs = await self.vector_store.search_async(user_query, k=self.retrieval_k)

        # Step 2: Rerank
        reranked_docs = self.reranker.rerank(user_query, docs, top_k=self.rerank_k)

        # Step 3: Assemble context
        context = self._build_context(reranked_docs)

        # Step 4: Build prompt
        prompt = self._build_prompt(user_query, context)

        # Step 5: Stream response
        async for token in self.llm.generate_stream(prompt, lora_adapter_id=lora_adapter):
            yield token

    def _build_context(self, docs: List[dict]) -> str:
        context_parts = []
        total_tokens = 0
        for doc in docs:
            doc_text = f"[Source: {doc['source']}]\n{doc['text']}"
            doc_tokens = len(doc_text.split()) * 1.3  # rough token estimate
            if total_tokens + doc_tokens > self.max_context_tokens:
                break
            context_parts.append(doc_text)
            total_tokens += doc_tokens
        return "\n\n---\n\n".join(context_parts)

    def _build_prompt(self, query: str, context: str) -> str:
        return (
            f"{self.system_prompt}\n\n"
            f"Context:\n{context}\n\n"
            f"User: {query}\nAssistant:"
        )

Cost Management

At 100K DAU with 10 requests per day at average 2,500 tokens (input + output), that is:

Daily token volume: 100K × 10 × 2,500 = 2.5 billion tokens per day
Output tokens (generation-heavy): ~500M per day
Throughput requirement: 2.5B / 86,400s = ~29,000 tokens/second sustained

One 8x A100 80GB node (NVLink) running vLLM with continuous batching and speculative decoding achieves approximately 2,000-4,000 output tokens/second for Llama-3-70B. You need 8-15 inference nodes for this workload.

Cost reduction strategies:

Prompt caching: System prompts and RAG context often share common prefixes. vLLM's prefix caching stores KV cache blocks for shared prefixes, reducing prefill compute by 40-60% for multi-turn conversations.

KV cache quantization: Quantize KV cache to INT8 (half the memory of FP16). Acceptable quality loss on most tasks. Doubles the effective batch size for the same VRAM.

Dynamic model selection: Route simple queries (short input, factual, no reasoning) to Llama-3-8B instead of Llama-3-70B. A query classifier determines routing. 8B costs approximately 8x less per token. Routing 60% of traffic to 8B reduces total cost by 50%.

class ModelRouter:
    """Route queries to smaller models when the task does not require the large model."""

    def __init__(self, small_model_server, large_model_server, router_model):
        self.small = small_model_server
        self.large = large_model_server
        self.router = router_model  # fast classifier

    def route(self, query: str, context: str) -> str:
        """Return 'small' or 'large' based on query complexity."""
        combined_length = len(query.split()) + len(context.split())
        if combined_length > 800:
            return "large"  # Long context → large model

        complexity_score = self.router.predict_complexity(query)
        if complexity_score > 0.6:
            return "large"  # Complex reasoning → large model

        return "small"  # Simple factual, summarization → small model

    async def generate(self, query: str, context: str) -> AsyncGenerator:
        model_choice = self.route(query, context)
        server = self.large if model_choice == "large" else self.small
        async for token in server.generate_stream(query):
            yield token

Production Engineering Notes

Observability Stack

LLM systems require specialized observability beyond standard metrics:

Token-level metrics: Input tokens per request, output tokens per request, tokens per second, KV cache usage %, queue depth.

Quality metrics: Hallucination rate (measured via fact-checking on a sample), relevance score (LLM-as-judge), user satisfaction (thumbs up/down).

Cost tracking: Tokens per dollar, cost per user, cost per feature, LoRA adapter usage.

Alerting thresholds: p99 TTFT above 3 seconds → page on-call. KV cache utilization above 90% → scale up. Queue depth above 1000 → scale up. Output throughput drops 20% → investigate.

Handling Traffic Spikes

Marketing emails and product launches create 10x traffic spikes. Strategies:

Request queuing with timeout: Queue requests in Redis. Drop requests that wait more than 10 seconds. Return a 503 with a retry-after header. This prevents the system from being overwhelmed.

Preemptive scaling: Use predictable traffic patterns (email send times, business hours) to pre-scale before the spike. Kubernetes HPA on custom GPU metrics.

Request shedding: Under extreme load, apply quality-of-service policies: prioritize paid users over free tier, short requests over long ones, cached prompt responses over fresh generation.

Common Mistakes

danger

Mistake: Allocating maximum sequence length VRAM for every request.

Naively reserving VRAM equal to max_seq_len for every in-flight request means 8192 tokens × 320KB/token = 2.5GB reserved per request. At 100 concurrent requests, that is 250GB - exceeding cluster capacity. PagedAttention avoids this by allocating KV cache pages on demand. Always use vLLM or TGI (which implement PagedAttention) for production LLM serving. Never use raw Hugging Face Transformers for high-throughput serving.

danger

Mistake: Not implementing request timeouts and circuit breakers.

LLM generation is unbounded in time - a pathological prompt can trigger a maximum-length response that takes 60+ seconds. Without timeouts, one slow request can stall the serving queue. Implement per-request token budgets, maximum generation time limits (e.g., 30 seconds), and circuit breakers that return an error and release GPU resources rather than blocking other requests.

warning

Mistake: Using tensor parallelism across nodes without fast interconnect.

Tensor parallelism requires AllReduce communication at every transformer layer. With NVLink (within a node), AllReduce bandwidth is 600+ GB/s - negligible overhead. Across nodes via InfiniBand, bandwidth drops to 100-200 GB/s, adding 20-50ms per layer for large models. For Llama-3-70B (80 layers), inter-node tensor parallelism adds 1.6-4 seconds of latency per request - unacceptable. Keep tensor parallelism within a single node. Use pipeline parallelism (split layers across nodes) for multi-node deployments, which has lower communication frequency.

tip

Tip: Cache system prompt KV computations for massive savings in multi-turn chat.

Every conversation starts with the same system prompt (e.g., "You are a helpful assistant for XYZ product..."). Without prefix caching, every request recomputes the KV cache for the system prompt - pure wasted compute. vLLM's enable_prefix_caching=True automatically identifies and reuses KV cache blocks for shared prefixes. For a 500-token system prompt with 100K DAU × 10 requests/day, prefix caching saves 500M tokens/day of prefill compute - approximately 30-40% of total compute.

Interview Q&A

Q: Walk me through the architecture for serving Llama-3-70B for a 100K DAU application.

A: Start with hardware requirements. Llama-3-70B in BF16 is 140GB, requiring at least 2x A100 80GB; we use 8x A100 80GB for adequate KV cache headroom. The serving layer uses vLLM for continuous batching and PagedAttention KV cache management. Continuous batching processes tokens from multiple requests in the same GPU forward pass, dramatically improving GPU utilization. PagedAttention allocates KV cache in blocks rather than pre-reserving max-length sequences, eliminating memory waste. We add speculative decoding with Llama-3-8B as the draft model - 5-token speculation gives approximately 2-3x throughput improvement on typical chat outputs. For multi-tenant LoRA serving, vLLM's LoRA support lets us serve 100+ fine-tuned variants on the same base model weights. For the RAG pipeline, we run retrieval asynchronously with the first tokens of generation when possible. The traffic handling layer queues requests in Redis, applies timeouts, and routes simple queries to an 8B model to reduce cost.

Q: What is PagedAttention and why is it important for LLM serving?

A: The KV cache - the stored key-value activations for previous tokens in a sequence - is the primary memory bottleneck in LLM serving. Naive serving pre-allocates a contiguous VRAM block equal to max_sequence_length for every in-flight request. Most requests use far less than the maximum length, so this wastes enormous amounts of VRAM. Additionally, since allocations are contiguous, memory fragmentation prevents using the freed space from completed requests. PagedAttention manages the KV cache like an OS manages virtual memory. The logical sequence KV cache is split into fixed-size blocks (16 tokens each). Physical VRAM blocks are allocated on demand as generation proceeds. Non-contiguous physical blocks store the logical sequence's KV cache, with a page table mapping logical to physical positions. Benefits: near-zero memory waste (only allocated tokens consume memory), zero fragmentation (blocks are fixed-size), and shared prefix caching (requests sharing a system prompt share the same physical KV blocks). This enables 2-4x more concurrent requests for the same VRAM budget.

Q: How would you reduce the cost of running LLMs in production by 50%?

A: Four approaches in combination. First, model routing: classify queries by complexity and route simple queries (factual lookup, short summarization) to a smaller model like Llama-3-8B, which is 10x cheaper per token. Even routing 40% of traffic to 8B reduces total cost by 36%. Second, prefix caching: enable KV cache sharing for system prompts and common RAG context prefixes. For applications with a long fixed system prompt, this saves 30-40% of prefill compute. Third, output length control: many applications let the model generate far more tokens than necessary. Add aggressive max_tokens limits and use few-shot examples that demonstrate concise responses. Reducing average output from 800 to 400 tokens cuts generation cost in half. Fourth, KV cache quantization: quantizing KV cache from FP16 to INT8 reduces memory by 2x, enabling larger batch sizes and better GPU utilization for the same VRAM.

Q: What is speculative decoding and when does it not help?

A: Speculative decoding uses a small draft model to propose $k$ tokens speculatively, then verifies all $k$ tokens in a single large-model forward pass. If the draft tokens match what the large model would have generated, they are accepted at the cost of one large-model forward pass instead of $k$ - a speedup proportional to the acceptance rate. It helps when: the draft model is architecturally similar to the target (same family, same tokenizer), generation temperature is moderate, and output follows predictable patterns (code, structured text, common phrases). It does not help when: temperature is very high (draft tokens rarely match), prompts require highly specific domain knowledge the small model lacks, or the overhead of running the draft model on every token exceeds the savings from accepted tokens. Speculative decoding adds latency for individual requests when acceptance rates are below ~50%; it improves batch throughput by reducing the number of sequential large-model forward passes required.

Deploying Llama-3-70B for 100,000 Daily Active Users​

Requirements Analysis​

LLM Inference Fundamentals​

Why LLM Inference is Hard​

The KV Cache Problem​

System Architecture​

Component 1: vLLM with PagedAttention​

Component 2: Tensor Parallelism​

Component 3: Speculative Decoding​

Component 4: LoRA Adapter Serving​

Component 5: RAG Integration​

Cost Management​

Production Engineering Notes​

Observability Stack​

Handling Traffic Spikes​

Common Mistakes​

Interview Q&A​