LLM Inference Optimization
Inference optimization is where engineering meets economics. Every millisecond of latency and every dollar of GPU cost matters at scale. Interviewers at companies like Anthropic, OpenAI, and Google DeepMind expect you to reason fluently about the full inference stack - from attention kernels to fleet-level cost optimization. This page covers every angle they will probe.
Why Interviewers Care
"Inference cost dominates our operating expense. A candidate who understands KV cache mechanics, batching strategies, and quantization trade-offs can save us millions per year. I want to see that you can reason about the full stack - from GPU memory hierarchy to fleet routing."
1. The Inference Bottleneck: Memory-Bound Decoding
Autoregressive Generation Fundamentals
LLM inference has two distinct phases:
- Prefill (prompt processing): All input tokens are processed in parallel. This phase is compute-bound - matrix multiplications dominate.
- Decode (token generation): Tokens are generated one at a time. Each step requires reading the full model weights but performs very little computation per byte read. This phase is memory-bandwidth-bound.
Key metrics:
- Time to First Token (TTFT): Dominated by prefill time. Critical for interactive applications.
- Time per Output Token (TPOT): Dominated by decode step latency. Determines streaming speed.
- Throughput: Tokens per second across all concurrent requests. Determines cost efficiency.
"LLM inference is memory-bandwidth-bound during decoding because each token generation reads the entire model weights but only produces one token. The arithmetic intensity is extremely low - on the order of 1-2 FLOPs per byte read. This is why optimizations focus on reducing memory reads (quantization, KV cache), increasing batch size (continuous batching), and reducing the number of decode steps (speculative decoding)."
The Roofline Model for LLM Inference
The arithmetic intensity of autoregressive decoding for a model with parameters serving batch size is approximately:
For an A100 GPU with 2 TB/s memory bandwidth and 312 TFLOPS compute, the crossover point is:
This means you need a batch size of roughly 156 before compute becomes the bottleneck. Most real-world serving scenarios operate well below this.
2. KV Cache
How It Works
During autoregressive generation, the key and value projections from all previous tokens are cached so they do not need to be recomputed at each step.
For a model with:
- layers
- attention heads
- head dimension
- Sequence length
- Batch size
The KV cache size is:
For Llama 2 70B (80 layers, 64 KV heads with GQA reducing to 8, ) at FP16 with sequence length 4096:
Candidates often forget that KV cache grows linearly with sequence length and batch size. With 100 concurrent sequences at 4K context, that is 134 GB just for KV cache - more than the model weights themselves on a single A100.
Grouped-Query Attention (GQA) for KV Cache Reduction
GQA reduces the number of KV heads relative to query heads. If there are query heads and KV heads:
Llama 2 70B uses GQA with 64 query heads and 8 KV heads - an 8x reduction in KV cache size.
| Model | Query Heads | KV Heads | Ratio | KV Cache / 4K seq |
|---|---|---|---|---|
| GPT-3 175B (MHA) | 96 | 96 | 1:1 | ~4.5 GB |
| Llama 2 70B (GQA) | 64 | 8 | 8:1 | ~1.34 GB |
| Llama 3 8B (GQA) | 32 | 8 | 4:1 | ~0.26 GB |
| Falcon 180B (MQA) | 232 | 1 | 232:1 | ~0.06 GB |
KV Cache Compression Techniques
- Sliding window attention: Only cache the last tokens (Mistral uses ).
- Token eviction: Remove low-attention tokens from cache (H2O, ScissorHands).
- Quantized KV cache: Store keys and values in INT8 or INT4 instead of FP16.
- Cross-layer sharing: Share KV across adjacent layers.
3. Continuous Batching
Static vs. Continuous Batching
Static batching pads all sequences to the longest and waits for every request to finish. GPU utilization drops as shorter sequences complete.
Continuous batching (also called iteration-level batching or in-flight batching):
- New requests join the batch at every decode step.
- Completed requests leave immediately.
- The batch size dynamically adjusts.
"Continuous batching replaces the request-level scheduling of static batching with iteration-level scheduling. At every decode step, finished sequences are evicted and new sequences can join. This dramatically improves GPU utilization because the GPU is never idle waiting for the longest sequence in a batch to finish. vLLM, TGI, and TensorRT-LLM all implement this."
PagedAttention (vLLM)
PagedAttention borrows virtual memory concepts from operating systems:
- KV cache is divided into fixed-size blocks (e.g., 16 tokens each).
- A block table maps logical KV positions to physical GPU memory blocks.
- Blocks are allocated on demand and freed when sequences complete.
- No internal fragmentation - only the last block may have unused slots.
- Enables copy-on-write for beam search and parallel sampling.
Memory waste comparison:
| Strategy | Fragmentation | Memory Utilization |
|---|---|---|
| Static pre-allocation | Up to 60-70% wasted | ~30-40% |
| Continuous batching (naive) | Variable | ~50-70% |
| PagedAttention | Near zero (last block only) | ~95%+ |
4. Speculative Decoding
Core Idea
Use a small, fast draft model to generate candidate tokens, then verify all tokens in a single forward pass of the large target model. If the target model agrees with the draft tokens, you get tokens for approximately the cost of one target model forward pass.
Mathematical Guarantee
Speculative decoding is mathematically lossless - the output distribution is identical to the target model. The acceptance probability for token is:
If a draft token is rejected, a correction token is sampled from:
Expected Speedup
If the acceptance rate per token is and draft length is , the expected number of accepted tokens per verification step is:
For and : tokens per step, yielding roughly a 2.5-3x wall-clock speedup.
Variants
| Variant | Draft Source | Notes |
|---|---|---|
| Standard speculative | Smaller model of same family | Llama 8B drafts for Llama 70B |
| Self-speculative | Early-exit from same model | No separate model needed |
| Medusa | Additional prediction heads | Trained heads predict future tokens |
| EAGLE | Feature-level draft | Drafts in feature space, not token space |
| Lookahead | Jacobi iteration | Parallel decoding via fixed-point iteration |
Speculative decoding only helps latency, not throughput. If the GPU is already saturated with a large batch, the draft model adds overhead without benefit because the bottleneck shifts to compute rather than memory bandwidth.
5. Quantization
Why Quantize
Quantization reduces the number of bits per parameter, which:
- Reduces memory - a 70B model goes from 140 GB (FP16) to 35 GB (INT4).
- Increases throughput - fewer bytes to read per decode step.
- Enables deployment on consumer hardware - fit larger models on fewer GPUs.
Quantization Formats
Weight-Only Quantization
| Method | Bits | Calibration | Key Idea |
|---|---|---|---|
| GPTQ | 3-4 bit | Yes (small dataset) | Layer-wise quantization via approximate second-order optimization (OBS) |
| AWQ | 4 bit | Yes | Protect salient weight channels; scale before quantizing |
| GGUF | 2-8 bit | No / Yes | CPU-friendly format; used by llama.cpp; per-block quantization |
| bitsandbytes | 4-8 bit | No (on-the-fly) | NF4 data type; double quantization; integrated with HuggingFace |
| SqueezeLLM | 3-4 bit | Yes | Non-uniform quantization; sensitivity-based bit allocation |
GPTQ Deep Dive
GPTQ quantizes weights layer by layer by solving:
where is a calibration dataset. It uses the Optimal Brain Surgeon (OBS) framework to find the quantized weights that minimize reconstruction error. Quantization order matters - GPTQ processes columns in order of increasing quantization error.
AWQ Deep Dive
AWQ observes that a small fraction of weights (about 1%) are salient - they correspond to large activation magnitudes. Rather than keeping these in higher precision, AWQ applies per-channel scaling:
where is chosen to minimize quantization error for the salient channels.
"GPTQ minimizes layer-wise reconstruction error using calibration data and the OBS framework. AWQ protects salient channels by scaling weights before quantization - it observes that 1% of channels carry disproportionate information. GGUF is a CPU-friendly format used by llama.cpp with per-block quantization. bitsandbytes provides on-the-fly NF4 quantization integrated with HuggingFace, requiring no calibration data."
Weight + Activation Quantization
| Method | Weights | Activations | Notes |
|---|---|---|---|
| SmoothQuant | INT8 | INT8 | Migrates quantization difficulty from activations to weights |
| FPTQ | INT4 | FP8 | Mixed precision |
| FP8 (H100 native) | FP8 | FP8 | Hardware-supported on Hopper GPUs |
Quantization Quality Comparison
Typical perplexity impact on Llama 2 70B (WikiText-2):
| Precision | Perplexity | Model Size | Speed vs FP16 |
|---|---|---|---|
| FP16 (baseline) | 3.32 | 140 GB | 1.0x |
| INT8 (bitsandbytes) | 3.33 | 70 GB | 1.2-1.5x |
| INT4 (GPTQ) | 3.39 | 35 GB | 1.5-2.0x |
| INT4 (AWQ) | 3.36 | 35 GB | 1.5-2.0x |
| INT3 (GPTQ) | 3.61 | 26 GB | 1.8-2.2x |
"Quantization just rounds weights to fewer bits." This misses the sophisticated optimization that methods like GPTQ and AWQ perform. GPTQ uses second-order information to minimize reconstruction error. AWQ identifies and protects salient channels. Understanding these details separates senior from junior candidates.
6. Flash Attention for Inference
Review of Flash Attention
Flash Attention avoids materializing the full attention matrix by computing attention in tiles that fit in GPU SRAM:
- Load a block of Q, K, V into SRAM.
- Compute local attention scores and output.
- Use online softmax to combine blocks without storing the full matrix.
Memory complexity: instead of .
Flash Attention in Inference Context
During prefill, Flash Attention provides significant speedups because the full prompt attention matrix would otherwise be materialized.
During decode, each step only computes attention between one new query token and all cached keys/values. Flash Attention still helps by:
- Avoiding materializing the attention row in HBM.
- Fusing the softmax, matmul, and scaling into a single kernel.
FlashDecoding and FlashDecoding++
Standard decode attention is parallelized across batch and heads. FlashDecoding adds parallelism across the KV sequence length dimension:
- Split the KV cache into chunks.
- Compute partial attention for each chunk in parallel.
- Reduce (combine) the partial results using the log-sum-exp trick.
This is critical for long-context inference where the KV sequence can be 100K+ tokens.
7. Model Serving Frameworks
Framework Comparison
| Feature | vLLM | TensorRT-LLM | TGI | Ollama |
|---|---|---|---|---|
| PagedAttention | Yes (inventor) | Yes | Yes | Via llama.cpp |
| Continuous Batching | Yes | Yes | Yes | Limited |
| Speculative Decoding | Yes | Yes | Yes | No |
| Quantization | GPTQ, AWQ, FP8 | INT4, INT8, FP8 | GPTQ, AWQ, bitsandbytes | GGUF (all levels) |
| Tensor Parallelism | Yes | Yes | Yes | No |
| Pipeline Parallelism | Limited | Yes | No | No |
| Hardware | NVIDIA, AMD, TPU | NVIDIA only | NVIDIA, AMD | CPU, NVIDIA, Apple Silicon |
| Best For | General GPU serving | Max throughput NVIDIA | HuggingFace ecosystem | Local / edge deployment |
| API | OpenAI-compatible | Triton + OpenAI | REST + gRPC | REST (OpenAI-compatible) |
vLLM Architecture
Key vLLM features:
- Preemption policies: When GPU memory is full, vLLM can either swap KV cache to CPU or recompute it later.
- Prefix caching: Shared system prompts are cached and reused across requests.
- Chunked prefill: Long prompts are split into chunks to avoid blocking decode steps.
TensorRT-LLM
TensorRT-LLM compiles models into optimized TensorRT engines:
- Kernel fusion: Combines multiple operations into single GPU kernels.
- In-flight batching: NVIDIA's implementation of continuous batching.
- FP8 on Hopper: Native FP8 support for H100 GPUs with near-zero accuracy loss.
- KV cache quantization: INT8 KV cache with minimal quality impact.
NVIDIA/GPU-heavy companies expect deep TensorRT-LLM knowledge. Startups typically use vLLM or TGI for faster iteration. Apple/edge companies care about GGUF and on-device inference. Google will ask about TPU-specific optimizations (Pallas, XLA).
Ollama and Local Inference
Ollama wraps llama.cpp for local deployment:
- Uses GGUF format with various quantization levels (Q2_K through Q8_0).
- Supports Apple Metal, CUDA, and CPU inference.
- Modelfile system for customizing system prompts and parameters.
- Growing ecosystem for local AI applications.
8. Latency vs. Throughput Optimization
The Fundamental Trade-off
| Optimization | Latency Impact | Throughput Impact |
|---|---|---|
| Increase batch size | Worse (more memory pressure) | Better (amortize weight reads) |
| Speculative decoding | Better (fewer steps) | Neutral to worse |
| Quantization | Better (less memory to read) | Better (fit larger batches) |
| Tensor parallelism | Better (split across GPUs) | Neutral (communication overhead) |
| Continuous batching | Neutral | Much better |
| Flash Attention | Better (prefill) | Better (memory savings) |
| Prefix caching | Better TTFT | Better (avoid recomputation) |
Latency Breakdown
For a typical LLM serving request:
Total Latency = TTFT + (num_output_tokens × TPOT)
TTFT = Network latency + Queue wait + Prefill time
TPOT = Decode step time (model forward + sampling + scheduling overhead)
Typical numbers for Llama 2 70B on 2x A100:
- TTFT: 200-500ms (depending on prompt length)
- TPOT: 30-50ms per token
- For 200 output tokens: 6-10 seconds total
"I want candidates to decompose latency into TTFT and TPOT, explain which optimizations target which metric, and understand the latency-throughput trade-off. Bonus points for discussing how to set SLOs (e.g., p50 TTFT under 500ms, p99 TPOT under 100ms) and how to achieve them."
9. Cost Optimization Strategies
Semantic Caching
Cache LLM responses for semantically similar queries:
- Embed the incoming query.
- Search a vector store for similar past queries (cosine similarity above threshold, e.g., 0.95).
- If a match is found, return the cached response.
- Otherwise, call the LLM and cache the result.
Savings: Can reduce LLM calls by 20-40% for applications with repetitive queries (customer support, FAQ bots).
Prompt Caching (Provider-Level)
Anthropic and OpenAI offer prompt caching where repeated prefixes are cached server-side:
- Only the new portion of the prompt incurs full cost.
- Cache hits are billed at reduced rates (e.g., 90% discount on cached tokens with Anthropic).
- Critical for RAG systems with large, repeated system prompts.
Tiered Model Routing
Router implementation approaches:
- Keyword/rule-based: Simple, fast, but brittle.
- Classifier-based: Train a small model to predict which LLM tier is needed.
- Embedding similarity: Route based on query similarity to examples requiring different tiers.
- Adaptive: Start with a small model; escalate if confidence is low.
Cost savings example:
- Without routing: 100% of queries to GPT-4o at 5.00 per M tokens.
- With routing: 60% to mini (1.50) + 10% to 4.5 (3.09 per M tokens. A 38% reduction.
Other Cost Strategies
| Strategy | Savings | Complexity |
|---|---|---|
| Prompt compression | 30-50% on input tokens | Low |
| Response length limits | Variable | Low |
| Semantic caching | 20-40% of calls | Medium |
| Tiered routing | 30-50% | Medium |
| Self-hosted vs. API | 50-80% at scale | High |
| Spot instances | 60-70% on GPU cost | High |
Startups obsess over API costs and favor routing + caching. Large companies focus on self-hosted inference optimization (vLLM tuning, quantization). AI labs care about training-inference co-optimization and fleet utilization.
10. Advanced Topics
Tensor Parallelism vs. Pipeline Parallelism
Tensor Parallelism (TP): Split individual layers across GPUs. Each GPU holds a slice of every layer.
- Best for latency reduction.
- Requires high-bandwidth interconnect (NVLink).
- Typical: 2-8 GPUs within a single node.
Pipeline Parallelism (PP): Split layers sequentially across GPUs. Each GPU holds a contiguous group of layers.
- Best for throughput (pipeline different micro-batches).
- Works with lower-bandwidth interconnect.
- Typical: across nodes.
For a 70B model:
- 2 GPU, TP=2: Each GPU holds half of every layer. Latency is roughly halved.
- 2 GPU, PP=2: GPU 1 holds layers 0-39, GPU 2 holds layers 40-79. Latency is similar to 1 GPU (pipeline bubble), but throughput doubles.
Disaggregated Prefill and Decode
Separate prefill and decode into different GPU pools:
- Prefill pool: Optimized for compute (high batch size, large matrices).
- Decode pool: Optimized for memory bandwidth (continuous batching, long KV cache).
This allows different GPU types and configurations for each phase. Emerging approach used at scale by companies like Databricks (MoAI) and various startups.
Structured Output Optimization
When generating JSON or structured output:
- Constrained decoding: Mask logits to only allow valid tokens according to a grammar/schema. Eliminates retries.
- Outlines / jsonformer: Libraries that enforce output structure during generation.
- Batched structured generation: vLLM supports grammar-guided generation at batch level.
Practice Problems
Problem 1: KV Cache Memory Planning
You are deploying Llama 3 70B (80 layers, 8 KV heads, head dim 128) on 4x A100 80GB GPUs with tensor parallelism. The target is serving 64 concurrent requests at up to 8K context length with FP16 KV cache. Will the KV cache fit?
Hint 1 \text{---} Direction
Calculate the total KV cache memory for 64 concurrent sequences at 8K length. Then calculate available memory after loading the model weights.
Hint 2 \text{---} Insight
KV cache per sequence = bytes. Model weights at FP16 = 140 GB. With TP=4, each GPU holds 35 GB of weights, leaving 45 GB for KV cache.
Full Solution + Rubric
KV cache per sequence:
Total KV cache for 64 sequences:
With TP=4, KV cache is also split across GPUs:
Available per GPU: 80 GB - 35 GB (weights) - 2 GB (overhead) = 43 GB.
Answer: It barely fits with about 0.1 GB of headroom, which is too tight. Solutions:
- Use INT8 KV cache to halve KV memory (21.4 GB per GPU).
- Reduce max concurrent requests to 48.
- Use INT4 model weights (GPTQ) to free more memory.
Scoring:
- Strong Hire: Correctly calculates KV cache, accounts for TP splitting, identifies the tight fit, and proposes solutions with trade-off analysis.
- Lean Hire: Gets the calculation right but misses TP splitting or does not propose solutions.
- No Hire: Cannot set up the calculation or significantly miscalculates.
Problem 2: Speculative Decoding ROI
Your 70B model runs at 40ms per token. You have a 7B draft model that runs at 5ms per token. The acceptance rate is 0.75 with draft length K=4. What is the effective tokens-per-second rate with speculative decoding? Is it worth deploying?
Hint 1 - Direction
Calculate the expected accepted tokens per verification cycle, then the total time per cycle.
Hint 2 - Insight
Expected accepted tokens = . Time per cycle = .
Full Solution + Rubric
Expected accepted tokens per cycle:
Time per cycle:
Effective tokens per second:
Without speculative decoding:
Speedup:
Is it worth it? Yes for latency-sensitive single-request scenarios. But consider:
- The draft model consumes GPU memory (about 14 GB for 7B at FP16).
- In high-throughput scenarios, the extra memory for the draft model could be used for larger batches instead.
- Acceptance rate depends on query distribution - code completion typically has higher acceptance than creative writing.
Scoring:
- Strong Hire: Correct calculation, discusses when speculative decoding is beneficial vs. not, mentions memory trade-off.
- Lean Hire: Correct calculation but limited discussion of trade-offs.
- No Hire: Cannot apply the expected value formula or makes fundamental errors.
Problem 3: Cost Optimization Architecture
You are spending $50K/month on OpenAI API calls for a customer support chatbot handling 1M conversations/month. Design an optimization strategy to reduce costs by 50% without degrading quality.
Hint 1 \text{---} Direction
Think about which conversations are simple (FAQ-like) vs. complex. Consider caching, routing, and prompt optimization.
Hint 2 \text{---} Insight
Layer your approach: semantic caching for repeated questions, tiered routing for complexity, prompt compression for all queries. Measure quality with automated evals before and after.
Full Solution + Rubric
Architecture:
-
Semantic caching (saves ~25%): Embed queries, cache responses for similarity above 0.95. Customer support has high query repetition.
-
Tiered routing (saves ~35% of remaining):
- Train a classifier on historical conversations.
- Route ~50% of simple queries (password reset, order status) to GPT-4o-mini.
- Route ~40% of medium queries to GPT-4o.
- Route only ~10% of complex queries (complaints, escalations) to the most capable model.
-
Prompt optimization (saves ~20% of remaining):
- Compress system prompt from 2000 to 800 tokens (remove examples the model already handles).
- Use Anthropic/OpenAI prompt caching for shared prefixes.
- Limit response length with stop sequences.
-
Quality assurance:
- A/B test with human evaluation on 1% of traffic.
- Automated quality scoring (LLM-as-judge) on all responses.
- Escalation rate monitoring per tier.
Projected savings: 25% + (75% x 35%) + (remaining x 20%) = approximately 55-60% reduction.
Scoring:
- Strong Hire: Proposes a layered strategy with specific savings estimates, quality monitoring, and rollout plan.
- Lean Hire: Mentions routing and caching but lacks specificity or quality safeguards.
- No Hire: Only suggests "use a cheaper model" without architectural thinking.
Interview Cheat Sheet
| Topic | Key Fact | Typical Question |
|---|---|---|
| KV Cache | "How much memory does the KV cache use?" | |
| GQA | Reduces KV heads; Llama 2 70B uses 8:1 ratio | "How does GQA reduce memory?" |
| Continuous Batching | Iteration-level scheduling; evict/add per step | "Why is continuous batching better than static?" |
| PagedAttention | OS-style virtual memory for KV cache; ~95% utilization | "How does vLLM manage KV cache memory?" |
| Speculative Decoding | Mathematically lossless; uses acceptance | "Does speculative decoding change the output distribution?" |
| GPTQ | Layer-wise OBS; calibration data needed | "How does GPTQ differ from naive rounding?" |
| AWQ | Protects salient channels via scaling | "Why does AWQ outperform GPTQ at the same bit width?" |
| Flash Attention | Tiled computation; memory | "Why is Flash Attention faster despite more FLOPs?" |
| Latency vs Throughput | Batch size is the key lever | "How do you optimize for latency vs throughput?" |
| Cost Optimization | Caching + routing + compression | "How would you cut inference costs by 50%?" |
Spaced Repetition Checkpoints
Day 0 (Today)
- Explain why autoregressive decoding is memory-bandwidth-bound
- Calculate KV cache size for a given model configuration
- Describe continuous batching vs. static batching
- List 4 quantization methods and their key differences
Day 3
- Derive the speculative decoding acceptance probability
- Explain PagedAttention and its memory utilization benefit
- Compare vLLM, TensorRT-LLM, TGI, and Ollama
- Calculate expected speedup from speculative decoding
Day 7
- Design a cost optimization strategy for a $50K/month LLM deployment
- Explain the latency-throughput trade-off with specific optimizations
- Describe FlashDecoding and why it matters for long-context inference
- Compare tensor parallelism vs. pipeline parallelism for inference
Day 14
- Whiteboard a complete inference serving architecture (model, batching, caching, routing)
- Explain GPTQ's second-order optimization in detail
- Design a tiered model routing system with quality monitoring
- Calculate memory requirements for a multi-model deployment
Day 21
- Present a 30-minute deep dive on any inference optimization topic
- Critique a given inference architecture and propose improvements
- Explain disaggregated prefill/decode and when it is beneficial
- Derive the roofline model crossover point for batch size
Cross-References
- Transformer Internals - Attention mechanisms, GQA, and Flash Attention fundamentals
- Fine-Tuning - Quantized fine-tuning (QLoRA) and its relationship to inference quantization
- RAG Systems - Serving optimization for RAG pipelines
- Agent Architectures - Inference optimization for multi-step agent loops
- LLM Interview Questions Bank - Additional inference optimization questions
