LLM Inference Optimization

Inference optimization is where engineering meets economics. Every millisecond of latency and every dollar of GPU cost matters at scale. Interviewers at companies like Anthropic, OpenAI, and Google DeepMind expect you to reason fluently about the full inference stack - from attention kernels to fleet-level cost optimization. This page covers every angle they will probe.

Why Interviewers Care

Interviewer's Perspective

"Inference cost dominates our operating expense. A candidate who understands KV cache mechanics, batching strategies, and quantization trade-offs can save us millions per year. I want to see that you can reason about the full stack - from GPU memory hierarchy to fleet routing."

1. The Inference Bottleneck: Memory-Bound Decoding

Autoregressive Generation Fundamentals

LLM inference has two distinct phases:

Prefill (prompt processing): All input tokens are processed in parallel. This phase is compute-bound - matrix multiplications dominate.
Decode (token generation): Tokens are generated one at a time. Each step requires reading the full model weights but performs very little computation per byte read. This phase is memory-bandwidth-bound.

Prefill and Decode Phases

Key metrics:

Time to First Token (TTFT): Dominated by prefill time. Critical for interactive applications.
Time per Output Token (TPOT): Dominated by decode step latency. Determines streaming speed.
Throughput: Tokens per second across all concurrent requests. Determines cost efficiency.

60-Second Answer

"LLM inference is memory-bandwidth-bound during decoding because each token generation reads the entire model weights but only produces one token. The arithmetic intensity is extremely low - on the order of 1-2 FLOPs per byte read. This is why optimizations focus on reducing memory reads (quantization, KV cache), increasing batch size (continuous batching), and reducing the number of decode steps (speculative decoding)."

The Roofline Model for LLM Inference

The arithmetic intensity of autoregressive decoding for a model with $P$ parameters serving batch size $B$ is approximately:

$\text{Arithmetic Intensity} = \frac{2 \cdot B \cdot P}{2 \cdot P} = B \text{ FLOPs/byte}$

For an A100 GPU with 2 TB/s memory bandwidth and 312 TFLOPS compute, the crossover point is:

$B_{\text{crossover}} = \frac{312 \times 10^{12}}{2 \times 10^{12}} = 156$

This means you need a batch size of roughly 156 before compute becomes the bottleneck. Most real-world serving scenarios operate well below this.

2. KV Cache

How It Works

During autoregressive generation, the key and value projections from all previous tokens are cached so they do not need to be recomputed at each step.

For a model with:

$L$ layers
$H$ attention heads
$d_h$ head dimension
Sequence length $S$
Batch size $B$

The KV cache size is:

$\text{KV Cache} = 2 \times L \times H \times d_h \times S \times B \times \text{bytes per element}$

For Llama 2 70B (80 layers, 64 KV heads with GQA reducing to 8, $d_h = 128$ ) at FP16 with sequence length 4096:

$2 \times 80 \times 8 \times 128 \times 4096 \times 2 \text{ bytes} \approx 1.34 \text{ GB per sequence}$

Common Trap

Candidates often forget that KV cache grows linearly with sequence length and batch size. With 100 concurrent sequences at 4K context, that is 134 GB just for KV cache - more than the model weights themselves on a single A100.

Grouped-Query Attention (GQA) for KV Cache Reduction

GQA reduces the number of KV heads relative to query heads. If there are $H_q$ query heads and $H_{kv}$ KV heads:

$\text{KV Cache Reduction} = \frac{H_{kv}}{H_q}$

Llama 2 70B uses GQA with 64 query heads and 8 KV heads - an 8x reduction in KV cache size.

Model	Query Heads	KV Heads	Ratio	KV Cache / 4K seq
GPT-3 175B (MHA)	96	96	1:1	~4.5 GB
Llama 2 70B (GQA)	64	8	8:1	~1.34 GB
Llama 3 8B (GQA)	32	8	4:1	~0.26 GB
Falcon 180B (MQA)	232	1	232:1	~0.06 GB

KV Cache Compression Techniques

Sliding window attention: Only cache the last $W$ tokens (Mistral uses $W = 4096$ ).
Token eviction: Remove low-attention tokens from cache (H2O, ScissorHands).
Quantized KV cache: Store keys and values in INT8 or INT4 instead of FP16.
Cross-layer sharing: Share KV across adjacent layers.

3. Continuous Batching

Static vs. Continuous Batching

Static vs Continuous Batching

Static batching pads all sequences to the longest and waits for every request to finish. GPU utilization drops as shorter sequences complete.

Continuous batching (also called iteration-level batching or in-flight batching):

New requests join the batch at every decode step.
Completed requests leave immediately.
The batch size dynamically adjusts.

60-Second Answer

"Continuous batching replaces the request-level scheduling of static batching with iteration-level scheduling. At every decode step, finished sequences are evicted and new sequences can join. This dramatically improves GPU utilization because the GPU is never idle waiting for the longest sequence in a batch to finish. vLLM, TGI, and TensorRT-LLM all implement this."

PagedAttention (vLLM)

PagedAttention borrows virtual memory concepts from operating systems:

KV cache is divided into fixed-size blocks (e.g., 16 tokens each).
A block table maps logical KV positions to physical GPU memory blocks.
Blocks are allocated on demand and freed when sequences complete.
No internal fragmentation - only the last block may have unused slots.
Enables copy-on-write for beam search and parallel sampling.

Memory waste comparison:

Strategy	Fragmentation	Memory Utilization
Static pre-allocation	Up to 60-70% wasted	~30-40%
Continuous batching (naive)	Variable	~50-70%
PagedAttention	Near zero (last block only)	~95%+

4. Speculative Decoding

Core Idea

Use a small, fast draft model to generate $K$ candidate tokens, then verify all $K$ tokens in a single forward pass of the large target model. If the target model agrees with the draft tokens, you get $K$ tokens for approximately the cost of one target model forward pass.

Speculative Decoding

Mathematical Guarantee

Speculative decoding is mathematically lossless - the output distribution is identical to the target model. The acceptance probability for token $i$ is:

$P(\text{accept}_i) = \min\left(1, \frac{p_{\text{target}}(x_i)}{p_{\text{draft}}(x_i)}\right)$

If a draft token is rejected, a correction token is sampled from:

$p_{\text{corrected}}(x) = \text{norm}\left(\max\left(0, p_{\text{target}}(x) - p_{\text{draft}}(x)\right)\right)$

Expected Speedup

If the acceptance rate per token is $\alpha$ and draft length is $K$ , the expected number of accepted tokens per verification step is:

$\mathbb{E}[\text{accepted}] = \frac{1 - \alpha^{K+1}}{1 - \alpha}$

For $\alpha = 0.8$ and $K = 5$ : $\mathbb{E} \approx 3.4$ tokens per step, yielding roughly a 2.5-3x wall-clock speedup.

Variants

Variant	Draft Source	Notes
Standard speculative	Smaller model of same family	Llama 8B drafts for Llama 70B
Self-speculative	Early-exit from same model	No separate model needed
Medusa	Additional prediction heads	Trained heads predict future tokens
EAGLE	Feature-level draft	Drafts in feature space, not token space
Lookahead	Jacobi iteration	Parallel decoding via fixed-point iteration

Common Trap

Speculative decoding only helps latency, not throughput. If the GPU is already saturated with a large batch, the draft model adds overhead without benefit because the bottleneck shifts to compute rather than memory bandwidth.

5. Quantization

Why Quantize

Quantization reduces the number of bits per parameter, which:

Reduces memory - a 70B model goes from 140 GB (FP16) to 35 GB (INT4).
Increases throughput - fewer bytes to read per decode step.
Enables deployment on consumer hardware - fit larger models on fewer GPUs.

Quantization Formats

Weight-Only Quantization

Method	Bits	Calibration	Key Idea
GPTQ	3-4 bit	Yes (small dataset)	Layer-wise quantization via approximate second-order optimization (OBS)
AWQ	4 bit	Yes	Protect salient weight channels; scale before quantizing
GGUF	2-8 bit	No / Yes	CPU-friendly format; used by llama.cpp; per-block quantization
bitsandbytes	4-8 bit	No (on-the-fly)	NF4 data type; double quantization; integrated with HuggingFace
SqueezeLLM	3-4 bit	Yes	Non-uniform quantization; sensitivity-based bit allocation

GPTQ Deep Dive

GPTQ quantizes weights layer by layer by solving:

$\min_{\hat{W}} \| WX - \hat{W}X \|_2^2$

where $X$ is a calibration dataset. It uses the Optimal Brain Surgeon (OBS) framework to find the quantized weights that minimize reconstruction error. Quantization order matters - GPTQ processes columns in order of increasing quantization error.

AWQ Deep Dive

AWQ observes that a small fraction of weights (about 1%) are salient - they correspond to large activation magnitudes. Rather than keeping these in higher precision, AWQ applies per-channel scaling:

$\hat{W} = \text{Quantize}(W \cdot s) \cdot \frac{1}{s}$

where $s$ is chosen to minimize quantization error for the salient channels.

60-Second Answer

"GPTQ minimizes layer-wise reconstruction error using calibration data and the OBS framework. AWQ protects salient channels by scaling weights before quantization - it observes that 1% of channels carry disproportionate information. GGUF is a CPU-friendly format used by llama.cpp with per-block quantization. bitsandbytes provides on-the-fly NF4 quantization integrated with HuggingFace, requiring no calibration data."

Weight + Activation Quantization

Method	Weights	Activations	Notes
SmoothQuant	INT8	INT8	Migrates quantization difficulty from activations to weights
FPTQ	INT4	FP8	Mixed precision
FP8 (H100 native)	FP8	FP8	Hardware-supported on Hopper GPUs

Quantization Quality Comparison

Typical perplexity impact on Llama 2 70B (WikiText-2):

Precision	Perplexity	Model Size	Speed vs FP16
FP16 (baseline)	3.32	140 GB	1.0x
INT8 (bitsandbytes)	3.33	70 GB	1.2-1.5x
INT4 (GPTQ)	3.39	35 GB	1.5-2.0x
INT4 (AWQ)	3.36	35 GB	1.5-2.0x
INT3 (GPTQ)	3.61	26 GB	1.8-2.2x

Instant Rejection

"Quantization just rounds weights to fewer bits." This misses the sophisticated optimization that methods like GPTQ and AWQ perform. GPTQ uses second-order information to minimize reconstruction error. AWQ identifies and protects salient channels. Understanding these details separates senior from junior candidates.

6. Flash Attention for Inference

Review of Flash Attention

Flash Attention avoids materializing the full $S \times S$ attention matrix by computing attention in tiles that fit in GPU SRAM:

Load a block of Q, K, V into SRAM.
Compute local attention scores and output.
Use online softmax to combine blocks without storing the full matrix.

Memory complexity: $O(S)$ instead of $O(S^2)$ .

Flash Attention in Inference Context

During prefill, Flash Attention provides significant speedups because the full prompt attention matrix would otherwise be materialized.

During decode, each step only computes attention between one new query token and all cached keys/values. Flash Attention still helps by:

Avoiding materializing the $1 \times S$ attention row in HBM.
Fusing the softmax, matmul, and scaling into a single kernel.

FlashDecoding and FlashDecoding++

Standard decode attention is parallelized across batch and heads. FlashDecoding adds parallelism across the KV sequence length dimension:

Split the KV cache into chunks.
Compute partial attention for each chunk in parallel.
Reduce (combine) the partial results using the log-sum-exp trick.

This is critical for long-context inference where the KV sequence can be 100K+ tokens.

7. Model Serving Frameworks

Framework Comparison

Feature	vLLM	TensorRT-LLM	TGI	Ollama
PagedAttention	Yes (inventor)	Yes	Yes	Via llama.cpp
Continuous Batching	Yes	Yes	Yes	Limited
Speculative Decoding	Yes	Yes	Yes	No
Quantization	GPTQ, AWQ, FP8	INT4, INT8, FP8	GPTQ, AWQ, bitsandbytes	GGUF (all levels)
Tensor Parallelism	Yes	Yes	Yes	No
Pipeline Parallelism	Limited	Yes	No	No
Hardware	NVIDIA, AMD, TPU	NVIDIA only	NVIDIA, AMD	CPU, NVIDIA, Apple Silicon
Best For	General GPU serving	Max throughput NVIDIA	HuggingFace ecosystem	Local / edge deployment
API	OpenAI-compatible	Triton + OpenAI	REST + gRPC	REST (OpenAI-compatible)

vLLM Architecture

Key vLLM features:

Preemption policies: When GPU memory is full, vLLM can either swap KV cache to CPU or recompute it later.
Prefix caching: Shared system prompts are cached and reused across requests.
Chunked prefill: Long prompts are split into chunks to avoid blocking decode steps.

TensorRT-LLM

TensorRT-LLM compiles models into optimized TensorRT engines:

Kernel fusion: Combines multiple operations into single GPU kernels.
In-flight batching: NVIDIA's implementation of continuous batching.
FP8 on Hopper: Native FP8 support for H100 GPUs with near-zero accuracy loss.
KV cache quantization: INT8 KV cache with minimal quality impact.

Company Variation

NVIDIA/GPU-heavy companies expect deep TensorRT-LLM knowledge. Startups typically use vLLM or TGI for faster iteration. Apple/edge companies care about GGUF and on-device inference. Google will ask about TPU-specific optimizations (Pallas, XLA).

Ollama and Local Inference

Ollama wraps llama.cpp for local deployment:

Uses GGUF format with various quantization levels (Q2_K through Q8_0).
Supports Apple Metal, CUDA, and CPU inference.
Modelfile system for customizing system prompts and parameters.
Growing ecosystem for local AI applications.

8. Latency vs. Throughput Optimization

The Fundamental Trade-off

Latency vs Throughput Optimization

Optimization	Latency Impact	Throughput Impact
Increase batch size	Worse (more memory pressure)	Better (amortize weight reads)
Speculative decoding	Better (fewer steps)	Neutral to worse
Quantization	Better (less memory to read)	Better (fit larger batches)
Tensor parallelism	Better (split across GPUs)	Neutral (communication overhead)
Continuous batching	Neutral	Much better
Flash Attention	Better (prefill)	Better (memory savings)
Prefix caching	Better TTFT	Better (avoid recomputation)

Latency Breakdown

For a typical LLM serving request:

Total Latency = TTFT + (num_output_tokens × TPOT)

TTFT = Network latency + Queue wait + Prefill time
TPOT = Decode step time (model forward + sampling + scheduling overhead)

Typical numbers for Llama 2 70B on 2x A100:

TTFT: 200-500ms (depending on prompt length)
TPOT: 30-50ms per token
For 200 output tokens: 6-10 seconds total

Interviewer's Perspective

"I want candidates to decompose latency into TTFT and TPOT, explain which optimizations target which metric, and understand the latency-throughput trade-off. Bonus points for discussing how to set SLOs (e.g., p50 TTFT under 500ms, p99 TPOT under 100ms) and how to achieve them."

9. Cost Optimization Strategies

Semantic Caching

Cache LLM responses for semantically similar queries:

Embed the incoming query.
Search a vector store for similar past queries (cosine similarity above threshold, e.g., 0.95).
If a match is found, return the cached response.
Otherwise, call the LLM and cache the result.

Savings: Can reduce LLM calls by 20-40% for applications with repetitive queries (customer support, FAQ bots).

Prompt Caching (Provider-Level)

Anthropic and OpenAI offer prompt caching where repeated prefixes are cached server-side:

Only the new portion of the prompt incurs full cost.
Cache hits are billed at reduced rates (e.g., 90% discount on cached tokens with Anthropic).
Critical for RAG systems with large, repeated system prompts.

Tiered Model Routing

Router implementation approaches:

Keyword/rule-based: Simple, fast, but brittle.
Classifier-based: Train a small model to predict which LLM tier is needed.
Embedding similarity: Route based on query similarity to examples requiring different tiers.
Adaptive: Start with a small model; escalate if confidence is low.

Cost savings example:

Without routing: 100% of queries to GPT-4o at $5/M tokens =$ 5.00 per M tokens.
With routing: 60% to mini ( $0.09) + 30\% to 4o ($ 1.50) + 10% to 4.5 ( $1.50) =$ 3.09 per M tokens. A 38% reduction.

Other Cost Strategies

Strategy	Savings	Complexity
Prompt compression	30-50% on input tokens	Low
Response length limits	Variable	Low
Semantic caching	20-40% of calls	Medium
Tiered routing	30-50%	Medium
Self-hosted vs. API	50-80% at scale	High
Spot instances	60-70% on GPU cost	High

Company Variation

Startups obsess over API costs and favor routing + caching. Large companies focus on self-hosted inference optimization (vLLM tuning, quantization). AI labs care about training-inference co-optimization and fleet utilization.

10. Advanced Topics

Tensor Parallelism vs. Pipeline Parallelism

Tensor Parallelism (TP): Split individual layers across GPUs. Each GPU holds a slice of every layer.

Best for latency reduction.
Requires high-bandwidth interconnect (NVLink).
Typical: 2-8 GPUs within a single node.

Pipeline Parallelism (PP): Split layers sequentially across GPUs. Each GPU holds a contiguous group of layers.

Best for throughput (pipeline different micro-batches).
Works with lower-bandwidth interconnect.
Typical: across nodes.

For a 70B model:

2 GPU, TP=2: Each GPU holds half of every layer. Latency is roughly halved.
2 GPU, PP=2: GPU 1 holds layers 0-39, GPU 2 holds layers 40-79. Latency is similar to 1 GPU (pipeline bubble), but throughput doubles.

Disaggregated Prefill and Decode

Separate prefill and decode into different GPU pools:

Prefill pool: Optimized for compute (high batch size, large matrices).
Decode pool: Optimized for memory bandwidth (continuous batching, long KV cache).

This allows different GPU types and configurations for each phase. Emerging approach used at scale by companies like Databricks (MoAI) and various startups.

Structured Output Optimization

When generating JSON or structured output:

Constrained decoding: Mask logits to only allow valid tokens according to a grammar/schema. Eliminates retries.
Outlines / jsonformer: Libraries that enforce output structure during generation.
Batched structured generation: vLLM supports grammar-guided generation at batch level.

Practice Problems

Problem 1: KV Cache Memory Planning

You are deploying Llama 3 70B (80 layers, 8 KV heads, head dim 128) on 4x A100 80GB GPUs with tensor parallelism. The target is serving 64 concurrent requests at up to 8K context length with FP16 KV cache. Will the KV cache fit?

Hint 1 \text{---} Direction

Calculate the total KV cache memory for 64 concurrent sequences at 8K length. Then calculate available memory after loading the model weights.

Hint 2 \text{---} Insight

KV cache per sequence = $2 \times 80 \times 8 \times 128 \times 8192 \times 2$ bytes. Model weights at FP16 = 140 GB. With TP=4, each GPU holds 35 GB of weights, leaving 45 GB for KV cache.

Full Solution + Rubric

KV cache per sequence: $2 \times 80 \times 8 \times 128 \times 8192 \times 2 = 2.68 \text{ GB}$

Total KV cache for 64 sequences: $64 \times 2.68 = 171.5 \text{ GB}$

With TP=4, KV cache is also split across GPUs: $\frac{171.5}{4} = 42.9 \text{ GB per GPU}$

Available per GPU: 80 GB - 35 GB (weights) - 2 GB (overhead) = 43 GB.

Answer: It barely fits with about 0.1 GB of headroom, which is too tight. Solutions:

Use INT8 KV cache to halve KV memory (21.4 GB per GPU).
Reduce max concurrent requests to 48.
Use INT4 model weights (GPTQ) to free more memory.

Scoring:

Strong Hire: Correctly calculates KV cache, accounts for TP splitting, identifies the tight fit, and proposes solutions with trade-off analysis.
Lean Hire: Gets the calculation right but misses TP splitting or does not propose solutions.
No Hire: Cannot set up the calculation or significantly miscalculates.

Problem 2: Speculative Decoding ROI

Your 70B model runs at 40ms per token. You have a 7B draft model that runs at 5ms per token. The acceptance rate is 0.75 with draft length K=4. What is the effective tokens-per-second rate with speculative decoding? Is it worth deploying?

Hint 1 - Direction

Calculate the expected accepted tokens per verification cycle, then the total time per cycle.

Hint 2 - Insight

Expected accepted tokens = $\frac{1 - 0.75^5}{1 - 0.75}$ . Time per cycle = $K \times t_{\text{draft}} + t_{\text{target}}$ .

Full Solution + Rubric

Expected accepted tokens per cycle: $\mathbb{E} = \frac{1 - 0.75^5}{1 - 0.75} = \frac{1 - 0.237}{0.25} = \frac{0.763}{0.25} = 3.05$

Time per cycle: $t_{\text{cycle}} = 4 \times 5\text{ms} + 40\text{ms} = 60\text{ms}$

Effective tokens per second: $\frac{3.05}{0.060} \approx 50.8 \text{ tokens/sec}$

Without speculative decoding: $\frac{1}{0.040} = 25 \text{ tokens/sec}$

Speedup: $\frac{50.8}{25} \approx 2.03\times$

Is it worth it? Yes for latency-sensitive single-request scenarios. But consider:

The draft model consumes GPU memory (about 14 GB for 7B at FP16).
In high-throughput scenarios, the extra memory for the draft model could be used for larger batches instead.
Acceptance rate depends on query distribution - code completion typically has higher acceptance than creative writing.

Scoring:

Strong Hire: Correct calculation, discusses when speculative decoding is beneficial vs. not, mentions memory trade-off.
Lean Hire: Correct calculation but limited discussion of trade-offs.
No Hire: Cannot apply the expected value formula or makes fundamental errors.

Problem 3: Cost Optimization Architecture

You are spending $50K/month on OpenAI API calls for a customer support chatbot handling 1M conversations/month. Design an optimization strategy to reduce costs by 50% without degrading quality.

Hint 1 \text{---} Direction

Think about which conversations are simple (FAQ-like) vs. complex. Consider caching, routing, and prompt optimization.

Hint 2 \text{---} Insight

Layer your approach: semantic caching for repeated questions, tiered routing for complexity, prompt compression for all queries. Measure quality with automated evals before and after.

Full Solution + Rubric

Architecture:

Semantic caching (saves ~25%): Embed queries, cache responses for similarity above 0.95. Customer support has high query repetition.
Tiered routing (saves ~35% of remaining):
- Train a classifier on historical conversations.
- Route ~50% of simple queries (password reset, order status) to GPT-4o-mini.
- Route ~40% of medium queries to GPT-4o.
- Route only ~10% of complex queries (complaints, escalations) to the most capable model.
Prompt optimization (saves ~20% of remaining):
- Compress system prompt from 2000 to 800 tokens (remove examples the model already handles).
- Use Anthropic/OpenAI prompt caching for shared prefixes.
- Limit response length with stop sequences.
Quality assurance:
- A/B test with human evaluation on 1% of traffic.
- Automated quality scoring (LLM-as-judge) on all responses.
- Escalation rate monitoring per tier.

Projected savings: 25% + (75% x 35%) + (remaining x 20%) = approximately 55-60% reduction.

Scoring:

Strong Hire: Proposes a layered strategy with specific savings estimates, quality monitoring, and rollout plan.
Lean Hire: Mentions routing and caching but lacks specificity or quality safeguards.
No Hire: Only suggests "use a cheaper model" without architectural thinking.

Interview Cheat Sheet

Topic	Key Fact	Typical Question
KV Cache	$2 \times L \times H_{kv} \times d_h \times S \times B \times \text{bytes}$	"How much memory does the KV cache use?"
GQA	Reduces KV heads; Llama 2 70B uses 8:1 ratio	"How does GQA reduce memory?"
Continuous Batching	Iteration-level scheduling; evict/add per step	"Why is continuous batching better than static?"
PagedAttention	OS-style virtual memory for KV cache; ~95% utilization	"How does vLLM manage KV cache memory?"
Speculative Decoding	Mathematically lossless; uses $\min(1, p_t/p_d)$ acceptance	"Does speculative decoding change the output distribution?"
GPTQ	Layer-wise OBS; calibration data needed	"How does GPTQ differ from naive rounding?"
AWQ	Protects salient channels via scaling	"Why does AWQ outperform GPTQ at the same bit width?"
Flash Attention	Tiled computation; $O(S)$ memory	"Why is Flash Attention faster despite more FLOPs?"
Latency vs Throughput	Batch size is the key lever	"How do you optimize for latency vs throughput?"
Cost Optimization	Caching + routing + compression	"How would you cut inference costs by 50%?"

Spaced Repetition Checkpoints

Day 0 (Today)

Explain why autoregressive decoding is memory-bandwidth-bound
Calculate KV cache size for a given model configuration
Describe continuous batching vs. static batching
List 4 quantization methods and their key differences

Day 3

Derive the speculative decoding acceptance probability
Explain PagedAttention and its memory utilization benefit
Compare vLLM, TensorRT-LLM, TGI, and Ollama
Calculate expected speedup from speculative decoding

Day 7

Design a cost optimization strategy for a $50K/month LLM deployment
Explain the latency-throughput trade-off with specific optimizations
Describe FlashDecoding and why it matters for long-context inference
Compare tensor parallelism vs. pipeline parallelism for inference

Day 14

Whiteboard a complete inference serving architecture (model, batching, caching, routing)
Explain GPTQ's second-order optimization in detail
Design a tiered model routing system with quality monitoring
Calculate memory requirements for a multi-model deployment

Day 21

Present a 30-minute deep dive on any inference optimization topic
Critique a given inference architecture and propose improvements
Explain disaggregated prefill/decode and when it is beneficial
Derive the roofline model crossover point for batch size

Cross-References

Transformer Internals - Attention mechanisms, GQA, and Flash Attention fundamentals
Fine-Tuning - Quantized fine-tuning (QLoRA) and its relationship to inference quantization
RAG Systems - Serving optimization for RAG pipelines
Agent Architectures - Inference optimization for multi-step agent loops
LLM Interview Questions Bank - Additional inference optimization questions

Why Interviewers Care​

1. The Inference Bottleneck: Memory-Bound Decoding​

Autoregressive Generation Fundamentals​

The Roofline Model for LLM Inference​

2. KV Cache​

How It Works​

Grouped-Query Attention (GQA) for KV Cache Reduction​

KV Cache Compression Techniques​

3. Continuous Batching​

Static vs. Continuous Batching​

PagedAttention (vLLM)​

4. Speculative Decoding​

Core Idea​

Mathematical Guarantee​

Expected Speedup​

Variants​

5. Quantization​

Why Quantize​

Quantization Formats​

Weight-Only Quantization​

GPTQ Deep Dive​

AWQ Deep Dive​

Weight + Activation Quantization​

Quantization Quality Comparison​

6. Flash Attention for Inference​

Review of Flash Attention​

Flash Attention in Inference Context​

FlashDecoding and FlashDecoding++​

7. Model Serving Frameworks​

Framework Comparison​

vLLM Architecture​

TensorRT-LLM​

Ollama and Local Inference​

8. Latency vs. Throughput Optimization​

The Fundamental Trade-off​

Latency Breakdown​

9. Cost Optimization Strategies​

Semantic Caching​

Prompt Caching (Provider-Level)​

Tiered Model Routing​

Other Cost Strategies​

10. Advanced Topics​

Tensor Parallelism vs. Pipeline Parallelism​

Disaggregated Prefill and Decode​

Structured Output Optimization​

Practice Problems​

Problem 1: KV Cache Memory Planning​

Problem 2: Speculative Decoding ROI​

Problem 3: Cost Optimization Architecture​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

Day 0 (Today)​

Day 3​

Day 7​

Day 14​

Day 21​

Cross-References​

Why Interviewers Care

1. The Inference Bottleneck: Memory-Bound Decoding

Autoregressive Generation Fundamentals

The Roofline Model for LLM Inference

2. KV Cache

How It Works

Grouped-Query Attention (GQA) for KV Cache Reduction

KV Cache Compression Techniques

3. Continuous Batching

Static vs. Continuous Batching

PagedAttention (vLLM)

4. Speculative Decoding

Core Idea

Mathematical Guarantee

Expected Speedup

Variants

5. Quantization

Why Quantize

Quantization Formats

Weight-Only Quantization

GPTQ Deep Dive

AWQ Deep Dive

Weight + Activation Quantization

Quantization Quality Comparison

6. Flash Attention for Inference

Review of Flash Attention

Flash Attention in Inference Context

FlashDecoding and FlashDecoding++

7. Model Serving Frameworks

Framework Comparison

vLLM Architecture

TensorRT-LLM

Ollama and Local Inference

8. Latency vs. Throughput Optimization

The Fundamental Trade-off

Latency Breakdown

9. Cost Optimization Strategies

Semantic Caching

Prompt Caching (Provider-Level)

Tiered Model Routing

Other Cost Strategies

10. Advanced Topics

Tensor Parallelism vs. Pipeline Parallelism

Disaggregated Prefill and Decode

Structured Output Optimization

Practice Problems

Problem 1: KV Cache Memory Planning

Problem 2: Speculative Decoding ROI

Problem 3: Cost Optimization Architecture

Interview Cheat Sheet

Spaced Repetition Checkpoints

Day 0 (Today)

Day 3

Day 7

Day 14

Day 21

Cross-References