Skip to main content

LLM Inference Optimization

Inference optimization is where engineering meets economics. Every millisecond of latency and every dollar of GPU cost matters at scale. Interviewers at companies like Anthropic, OpenAI, and Google DeepMind expect you to reason fluently about the full inference stack - from attention kernels to fleet-level cost optimization. This page covers every angle they will probe.

Why Interviewers Care

Interviewer's Perspective

"Inference cost dominates our operating expense. A candidate who understands KV cache mechanics, batching strategies, and quantization trade-offs can save us millions per year. I want to see that you can reason about the full stack - from GPU memory hierarchy to fleet routing."

1. The Inference Bottleneck: Memory-Bound Decoding

Autoregressive Generation Fundamentals

LLM inference has two distinct phases:

  1. Prefill (prompt processing): All input tokens are processed in parallel. This phase is compute-bound - matrix multiplications dominate.
  2. Decode (token generation): Tokens are generated one at a time. Each step requires reading the full model weights but performs very little computation per byte read. This phase is memory-bandwidth-bound.

Prefill and Decode Phases

Key metrics:

  • Time to First Token (TTFT): Dominated by prefill time. Critical for interactive applications.
  • Time per Output Token (TPOT): Dominated by decode step latency. Determines streaming speed.
  • Throughput: Tokens per second across all concurrent requests. Determines cost efficiency.
60-Second Answer

"LLM inference is memory-bandwidth-bound during decoding because each token generation reads the entire model weights but only produces one token. The arithmetic intensity is extremely low - on the order of 1-2 FLOPs per byte read. This is why optimizations focus on reducing memory reads (quantization, KV cache), increasing batch size (continuous batching), and reducing the number of decode steps (speculative decoding)."

The Roofline Model for LLM Inference

The arithmetic intensity of autoregressive decoding for a model with PP parameters serving batch size BB is approximately:

Arithmetic Intensity=2BP2P=B FLOPs/byte\text{Arithmetic Intensity} = \frac{2 \cdot B \cdot P}{2 \cdot P} = B \text{ FLOPs/byte}

For an A100 GPU with 2 TB/s memory bandwidth and 312 TFLOPS compute, the crossover point is:

Bcrossover=312×10122×1012=156B_{\text{crossover}} = \frac{312 \times 10^{12}}{2 \times 10^{12}} = 156

This means you need a batch size of roughly 156 before compute becomes the bottleneck. Most real-world serving scenarios operate well below this.

2. KV Cache

How It Works

During autoregressive generation, the key and value projections from all previous tokens are cached so they do not need to be recomputed at each step.

For a model with:

  • LL layers
  • HH attention heads
  • dhd_h head dimension
  • Sequence length SS
  • Batch size BB

The KV cache size is:

KV Cache=2×L×H×dh×S×B×bytes per element\text{KV Cache} = 2 \times L \times H \times d_h \times S \times B \times \text{bytes per element}

For Llama 2 70B (80 layers, 64 KV heads with GQA reducing to 8, dh=128d_h = 128) at FP16 with sequence length 4096:

2×80×8×128×4096×2 bytes1.34 GB per sequence2 \times 80 \times 8 \times 128 \times 4096 \times 2 \text{ bytes} \approx 1.34 \text{ GB per sequence}

Common Trap

Candidates often forget that KV cache grows linearly with sequence length and batch size. With 100 concurrent sequences at 4K context, that is 134 GB just for KV cache - more than the model weights themselves on a single A100.

Grouped-Query Attention (GQA) for KV Cache Reduction

GQA reduces the number of KV heads relative to query heads. If there are HqH_q query heads and HkvH_{kv} KV heads:

KV Cache Reduction=HkvHq\text{KV Cache Reduction} = \frac{H_{kv}}{H_q}

Llama 2 70B uses GQA with 64 query heads and 8 KV heads - an 8x reduction in KV cache size.

ModelQuery HeadsKV HeadsRatioKV Cache / 4K seq
GPT-3 175B (MHA)96961:1~4.5 GB
Llama 2 70B (GQA)6488:1~1.34 GB
Llama 3 8B (GQA)3284:1~0.26 GB
Falcon 180B (MQA)2321232:1~0.06 GB

KV Cache Compression Techniques

  1. Sliding window attention: Only cache the last WW tokens (Mistral uses W=4096W = 4096).
  2. Token eviction: Remove low-attention tokens from cache (H2O, ScissorHands).
  3. Quantized KV cache: Store keys and values in INT8 or INT4 instead of FP16.
  4. Cross-layer sharing: Share KV across adjacent layers.

3. Continuous Batching

Static vs. Continuous Batching

Static vs Continuous Batching

Static batching pads all sequences to the longest and waits for every request to finish. GPU utilization drops as shorter sequences complete.

Continuous batching (also called iteration-level batching or in-flight batching):

  • New requests join the batch at every decode step.
  • Completed requests leave immediately.
  • The batch size dynamically adjusts.
60-Second Answer

"Continuous batching replaces the request-level scheduling of static batching with iteration-level scheduling. At every decode step, finished sequences are evicted and new sequences can join. This dramatically improves GPU utilization because the GPU is never idle waiting for the longest sequence in a batch to finish. vLLM, TGI, and TensorRT-LLM all implement this."

PagedAttention (vLLM)

PagedAttention borrows virtual memory concepts from operating systems:

  • KV cache is divided into fixed-size blocks (e.g., 16 tokens each).
  • A block table maps logical KV positions to physical GPU memory blocks.
  • Blocks are allocated on demand and freed when sequences complete.
  • No internal fragmentation - only the last block may have unused slots.
  • Enables copy-on-write for beam search and parallel sampling.

Memory waste comparison:

StrategyFragmentationMemory Utilization
Static pre-allocationUp to 60-70% wasted~30-40%
Continuous batching (naive)Variable~50-70%
PagedAttentionNear zero (last block only)~95%+

4. Speculative Decoding

Core Idea

Use a small, fast draft model to generate KK candidate tokens, then verify all KK tokens in a single forward pass of the large target model. If the target model agrees with the draft tokens, you get KK tokens for approximately the cost of one target model forward pass.

Speculative Decoding

Mathematical Guarantee

Speculative decoding is mathematically lossless - the output distribution is identical to the target model. The acceptance probability for token ii is:

P(accepti)=min(1,ptarget(xi)pdraft(xi))P(\text{accept}_i) = \min\left(1, \frac{p_{\text{target}}(x_i)}{p_{\text{draft}}(x_i)}\right)

If a draft token is rejected, a correction token is sampled from:

pcorrected(x)=norm(max(0,ptarget(x)pdraft(x)))p_{\text{corrected}}(x) = \text{norm}\left(\max\left(0, p_{\text{target}}(x) - p_{\text{draft}}(x)\right)\right)

Expected Speedup

If the acceptance rate per token is α\alpha and draft length is KK, the expected number of accepted tokens per verification step is:

E[accepted]=1αK+11α\mathbb{E}[\text{accepted}] = \frac{1 - \alpha^{K+1}}{1 - \alpha}

For α=0.8\alpha = 0.8 and K=5K = 5: E3.4\mathbb{E} \approx 3.4 tokens per step, yielding roughly a 2.5-3x wall-clock speedup.

Variants

VariantDraft SourceNotes
Standard speculativeSmaller model of same familyLlama 8B drafts for Llama 70B
Self-speculativeEarly-exit from same modelNo separate model needed
MedusaAdditional prediction headsTrained heads predict future tokens
EAGLEFeature-level draftDrafts in feature space, not token space
LookaheadJacobi iterationParallel decoding via fixed-point iteration
Common Trap

Speculative decoding only helps latency, not throughput. If the GPU is already saturated with a large batch, the draft model adds overhead without benefit because the bottleneck shifts to compute rather than memory bandwidth.

5. Quantization

Why Quantize

Quantization reduces the number of bits per parameter, which:

  1. Reduces memory - a 70B model goes from 140 GB (FP16) to 35 GB (INT4).
  2. Increases throughput - fewer bytes to read per decode step.
  3. Enables deployment on consumer hardware - fit larger models on fewer GPUs.

Quantization Formats

Weight-Only Quantization

MethodBitsCalibrationKey Idea
GPTQ3-4 bitYes (small dataset)Layer-wise quantization via approximate second-order optimization (OBS)
AWQ4 bitYesProtect salient weight channels; scale before quantizing
GGUF2-8 bitNo / YesCPU-friendly format; used by llama.cpp; per-block quantization
bitsandbytes4-8 bitNo (on-the-fly)NF4 data type; double quantization; integrated with HuggingFace
SqueezeLLM3-4 bitYesNon-uniform quantization; sensitivity-based bit allocation

GPTQ Deep Dive

GPTQ quantizes weights layer by layer by solving:

minW^WXW^X22\min_{\hat{W}} \| WX - \hat{W}X \|_2^2

where XX is a calibration dataset. It uses the Optimal Brain Surgeon (OBS) framework to find the quantized weights that minimize reconstruction error. Quantization order matters - GPTQ processes columns in order of increasing quantization error.

AWQ Deep Dive

AWQ observes that a small fraction of weights (about 1%) are salient - they correspond to large activation magnitudes. Rather than keeping these in higher precision, AWQ applies per-channel scaling:

W^=Quantize(Ws)1s\hat{W} = \text{Quantize}(W \cdot s) \cdot \frac{1}{s}

where ss is chosen to minimize quantization error for the salient channels.

60-Second Answer

"GPTQ minimizes layer-wise reconstruction error using calibration data and the OBS framework. AWQ protects salient channels by scaling weights before quantization - it observes that 1% of channels carry disproportionate information. GGUF is a CPU-friendly format used by llama.cpp with per-block quantization. bitsandbytes provides on-the-fly NF4 quantization integrated with HuggingFace, requiring no calibration data."

Weight + Activation Quantization

MethodWeightsActivationsNotes
SmoothQuantINT8INT8Migrates quantization difficulty from activations to weights
FPTQINT4FP8Mixed precision
FP8 (H100 native)FP8FP8Hardware-supported on Hopper GPUs

Quantization Quality Comparison

Typical perplexity impact on Llama 2 70B (WikiText-2):

PrecisionPerplexityModel SizeSpeed vs FP16
FP16 (baseline)3.32140 GB1.0x
INT8 (bitsandbytes)3.3370 GB1.2-1.5x
INT4 (GPTQ)3.3935 GB1.5-2.0x
INT4 (AWQ)3.3635 GB1.5-2.0x
INT3 (GPTQ)3.6126 GB1.8-2.2x
Instant Rejection

"Quantization just rounds weights to fewer bits." This misses the sophisticated optimization that methods like GPTQ and AWQ perform. GPTQ uses second-order information to minimize reconstruction error. AWQ identifies and protects salient channels. Understanding these details separates senior from junior candidates.

6. Flash Attention for Inference

Review of Flash Attention

Flash Attention avoids materializing the full S×SS \times S attention matrix by computing attention in tiles that fit in GPU SRAM:

  1. Load a block of Q, K, V into SRAM.
  2. Compute local attention scores and output.
  3. Use online softmax to combine blocks without storing the full matrix.

Memory complexity: O(S)O(S) instead of O(S2)O(S^2).

Flash Attention in Inference Context

During prefill, Flash Attention provides significant speedups because the full prompt attention matrix would otherwise be materialized.

During decode, each step only computes attention between one new query token and all cached keys/values. Flash Attention still helps by:

  • Avoiding materializing the 1×S1 \times S attention row in HBM.
  • Fusing the softmax, matmul, and scaling into a single kernel.

FlashDecoding and FlashDecoding++

Standard decode attention is parallelized across batch and heads. FlashDecoding adds parallelism across the KV sequence length dimension:

  1. Split the KV cache into chunks.
  2. Compute partial attention for each chunk in parallel.
  3. Reduce (combine) the partial results using the log-sum-exp trick.

This is critical for long-context inference where the KV sequence can be 100K+ tokens.

7. Model Serving Frameworks

Framework Comparison

FeaturevLLMTensorRT-LLMTGIOllama
PagedAttentionYes (inventor)YesYesVia llama.cpp
Continuous BatchingYesYesYesLimited
Speculative DecodingYesYesYesNo
QuantizationGPTQ, AWQ, FP8INT4, INT8, FP8GPTQ, AWQ, bitsandbytesGGUF (all levels)
Tensor ParallelismYesYesYesNo
Pipeline ParallelismLimitedYesNoNo
HardwareNVIDIA, AMD, TPUNVIDIA onlyNVIDIA, AMDCPU, NVIDIA, Apple Silicon
Best ForGeneral GPU servingMax throughput NVIDIAHuggingFace ecosystemLocal / edge deployment
APIOpenAI-compatibleTriton + OpenAIREST + gRPCREST (OpenAI-compatible)

vLLM Architecture

vLLM Architecture

Key vLLM features:

  • Preemption policies: When GPU memory is full, vLLM can either swap KV cache to CPU or recompute it later.
  • Prefix caching: Shared system prompts are cached and reused across requests.
  • Chunked prefill: Long prompts are split into chunks to avoid blocking decode steps.

TensorRT-LLM

TensorRT-LLM compiles models into optimized TensorRT engines:

  • Kernel fusion: Combines multiple operations into single GPU kernels.
  • In-flight batching: NVIDIA's implementation of continuous batching.
  • FP8 on Hopper: Native FP8 support for H100 GPUs with near-zero accuracy loss.
  • KV cache quantization: INT8 KV cache with minimal quality impact.
Company Variation

NVIDIA/GPU-heavy companies expect deep TensorRT-LLM knowledge. Startups typically use vLLM or TGI for faster iteration. Apple/edge companies care about GGUF and on-device inference. Google will ask about TPU-specific optimizations (Pallas, XLA).

Ollama and Local Inference

Ollama wraps llama.cpp for local deployment:

  • Uses GGUF format with various quantization levels (Q2_K through Q8_0).
  • Supports Apple Metal, CUDA, and CPU inference.
  • Modelfile system for customizing system prompts and parameters.
  • Growing ecosystem for local AI applications.

8. Latency vs. Throughput Optimization

The Fundamental Trade-off

Latency vs Throughput Optimization

OptimizationLatency ImpactThroughput Impact
Increase batch sizeWorse (more memory pressure)Better (amortize weight reads)
Speculative decodingBetter (fewer steps)Neutral to worse
QuantizationBetter (less memory to read)Better (fit larger batches)
Tensor parallelismBetter (split across GPUs)Neutral (communication overhead)
Continuous batchingNeutralMuch better
Flash AttentionBetter (prefill)Better (memory savings)
Prefix cachingBetter TTFTBetter (avoid recomputation)

Latency Breakdown

For a typical LLM serving request:

Total Latency = TTFT + (num_output_tokens × TPOT)

TTFT = Network latency + Queue wait + Prefill time
TPOT = Decode step time (model forward + sampling + scheduling overhead)

Typical numbers for Llama 2 70B on 2x A100:

  • TTFT: 200-500ms (depending on prompt length)
  • TPOT: 30-50ms per token
  • For 200 output tokens: 6-10 seconds total
Interviewer's Perspective

"I want candidates to decompose latency into TTFT and TPOT, explain which optimizations target which metric, and understand the latency-throughput trade-off. Bonus points for discussing how to set SLOs (e.g., p50 TTFT under 500ms, p99 TPOT under 100ms) and how to achieve them."

9. Cost Optimization Strategies

Semantic Caching

Cache LLM responses for semantically similar queries:

  1. Embed the incoming query.
  2. Search a vector store for similar past queries (cosine similarity above threshold, e.g., 0.95).
  3. If a match is found, return the cached response.
  4. Otherwise, call the LLM and cache the result.

Savings: Can reduce LLM calls by 20-40% for applications with repetitive queries (customer support, FAQ bots).

Prompt Caching (Provider-Level)

Anthropic and OpenAI offer prompt caching where repeated prefixes are cached server-side:

  • Only the new portion of the prompt incurs full cost.
  • Cache hits are billed at reduced rates (e.g., 90% discount on cached tokens with Anthropic).
  • Critical for RAG systems with large, repeated system prompts.

Tiered Model Routing

Tiered Model Routing

Router implementation approaches:

  1. Keyword/rule-based: Simple, fast, but brittle.
  2. Classifier-based: Train a small model to predict which LLM tier is needed.
  3. Embedding similarity: Route based on query similarity to examples requiring different tiers.
  4. Adaptive: Start with a small model; escalate if confidence is low.

Cost savings example:

  • Without routing: 100% of queries to GPT-4o at 5/Mtokens=5/M tokens = 5.00 per M tokens.
  • With routing: 60% to mini (0.09)+30%to4o(0.09) + 30\% to 4o (1.50) + 10% to 4.5 (1.50)=1.50) = 3.09 per M tokens. A 38% reduction.

Other Cost Strategies

StrategySavingsComplexity
Prompt compression30-50% on input tokensLow
Response length limitsVariableLow
Semantic caching20-40% of callsMedium
Tiered routing30-50%Medium
Self-hosted vs. API50-80% at scaleHigh
Spot instances60-70% on GPU costHigh
Company Variation

Startups obsess over API costs and favor routing + caching. Large companies focus on self-hosted inference optimization (vLLM tuning, quantization). AI labs care about training-inference co-optimization and fleet utilization.

10. Advanced Topics

Tensor Parallelism vs. Pipeline Parallelism

Tensor Parallelism (TP): Split individual layers across GPUs. Each GPU holds a slice of every layer.

  • Best for latency reduction.
  • Requires high-bandwidth interconnect (NVLink).
  • Typical: 2-8 GPUs within a single node.

Pipeline Parallelism (PP): Split layers sequentially across GPUs. Each GPU holds a contiguous group of layers.

  • Best for throughput (pipeline different micro-batches).
  • Works with lower-bandwidth interconnect.
  • Typical: across nodes.

For a 70B model:

  • 2 GPU, TP=2: Each GPU holds half of every layer. Latency is roughly halved.
  • 2 GPU, PP=2: GPU 1 holds layers 0-39, GPU 2 holds layers 40-79. Latency is similar to 1 GPU (pipeline bubble), but throughput doubles.

Disaggregated Prefill and Decode

Separate prefill and decode into different GPU pools:

  • Prefill pool: Optimized for compute (high batch size, large matrices).
  • Decode pool: Optimized for memory bandwidth (continuous batching, long KV cache).

This allows different GPU types and configurations for each phase. Emerging approach used at scale by companies like Databricks (MoAI) and various startups.

Structured Output Optimization

When generating JSON or structured output:

  • Constrained decoding: Mask logits to only allow valid tokens according to a grammar/schema. Eliminates retries.
  • Outlines / jsonformer: Libraries that enforce output structure during generation.
  • Batched structured generation: vLLM supports grammar-guided generation at batch level.

Practice Problems

Problem 1: KV Cache Memory Planning

You are deploying Llama 3 70B (80 layers, 8 KV heads, head dim 128) on 4x A100 80GB GPUs with tensor parallelism. The target is serving 64 concurrent requests at up to 8K context length with FP16 KV cache. Will the KV cache fit?

Hint 1 \text{---} Direction

Calculate the total KV cache memory for 64 concurrent sequences at 8K length. Then calculate available memory after loading the model weights.

Hint 2 \text{---} Insight

KV cache per sequence = 2×80×8×128×8192×22 \times 80 \times 8 \times 128 \times 8192 \times 2 bytes. Model weights at FP16 = 140 GB. With TP=4, each GPU holds 35 GB of weights, leaving 45 GB for KV cache.

Full Solution + Rubric

KV cache per sequence: 2×80×8×128×8192×2=2.68 GB2 \times 80 \times 8 \times 128 \times 8192 \times 2 = 2.68 \text{ GB}

Total KV cache for 64 sequences: 64×2.68=171.5 GB64 \times 2.68 = 171.5 \text{ GB}

With TP=4, KV cache is also split across GPUs: 171.54=42.9 GB per GPU\frac{171.5}{4} = 42.9 \text{ GB per GPU}

Available per GPU: 80 GB - 35 GB (weights) - 2 GB (overhead) = 43 GB.

Answer: It barely fits with about 0.1 GB of headroom, which is too tight. Solutions:

  1. Use INT8 KV cache to halve KV memory (21.4 GB per GPU).
  2. Reduce max concurrent requests to 48.
  3. Use INT4 model weights (GPTQ) to free more memory.

Scoring:

  • Strong Hire: Correctly calculates KV cache, accounts for TP splitting, identifies the tight fit, and proposes solutions with trade-off analysis.
  • Lean Hire: Gets the calculation right but misses TP splitting or does not propose solutions.
  • No Hire: Cannot set up the calculation or significantly miscalculates.

Problem 2: Speculative Decoding ROI

Your 70B model runs at 40ms per token. You have a 7B draft model that runs at 5ms per token. The acceptance rate is 0.75 with draft length K=4. What is the effective tokens-per-second rate with speculative decoding? Is it worth deploying?

Hint 1 - Direction

Calculate the expected accepted tokens per verification cycle, then the total time per cycle.

Hint 2 - Insight

Expected accepted tokens = 10.75510.75\frac{1 - 0.75^5}{1 - 0.75}. Time per cycle = K×tdraft+ttargetK \times t_{\text{draft}} + t_{\text{target}}.

Full Solution + Rubric

Expected accepted tokens per cycle: E=10.75510.75=10.2370.25=0.7630.25=3.05\mathbb{E} = \frac{1 - 0.75^5}{1 - 0.75} = \frac{1 - 0.237}{0.25} = \frac{0.763}{0.25} = 3.05

Time per cycle: tcycle=4×5ms+40ms=60mst_{\text{cycle}} = 4 \times 5\text{ms} + 40\text{ms} = 60\text{ms}

Effective tokens per second: 3.050.06050.8 tokens/sec\frac{3.05}{0.060} \approx 50.8 \text{ tokens/sec}

Without speculative decoding: 10.040=25 tokens/sec\frac{1}{0.040} = 25 \text{ tokens/sec}

Speedup: 50.8252.03×\frac{50.8}{25} \approx 2.03\times

Is it worth it? Yes for latency-sensitive single-request scenarios. But consider:

  • The draft model consumes GPU memory (about 14 GB for 7B at FP16).
  • In high-throughput scenarios, the extra memory for the draft model could be used for larger batches instead.
  • Acceptance rate depends on query distribution - code completion typically has higher acceptance than creative writing.

Scoring:

  • Strong Hire: Correct calculation, discusses when speculative decoding is beneficial vs. not, mentions memory trade-off.
  • Lean Hire: Correct calculation but limited discussion of trade-offs.
  • No Hire: Cannot apply the expected value formula or makes fundamental errors.

Problem 3: Cost Optimization Architecture

You are spending $50K/month on OpenAI API calls for a customer support chatbot handling 1M conversations/month. Design an optimization strategy to reduce costs by 50% without degrading quality.

Hint 1 \text{---} Direction

Think about which conversations are simple (FAQ-like) vs. complex. Consider caching, routing, and prompt optimization.

Hint 2 \text{---} Insight

Layer your approach: semantic caching for repeated questions, tiered routing for complexity, prompt compression for all queries. Measure quality with automated evals before and after.

Full Solution + Rubric

Architecture:

  1. Semantic caching (saves ~25%): Embed queries, cache responses for similarity above 0.95. Customer support has high query repetition.

  2. Tiered routing (saves ~35% of remaining):

    • Train a classifier on historical conversations.
    • Route ~50% of simple queries (password reset, order status) to GPT-4o-mini.
    • Route ~40% of medium queries to GPT-4o.
    • Route only ~10% of complex queries (complaints, escalations) to the most capable model.
  3. Prompt optimization (saves ~20% of remaining):

    • Compress system prompt from 2000 to 800 tokens (remove examples the model already handles).
    • Use Anthropic/OpenAI prompt caching for shared prefixes.
    • Limit response length with stop sequences.
  4. Quality assurance:

    • A/B test with human evaluation on 1% of traffic.
    • Automated quality scoring (LLM-as-judge) on all responses.
    • Escalation rate monitoring per tier.

Projected savings: 25% + (75% x 35%) + (remaining x 20%) = approximately 55-60% reduction.

Scoring:

  • Strong Hire: Proposes a layered strategy with specific savings estimates, quality monitoring, and rollout plan.
  • Lean Hire: Mentions routing and caching but lacks specificity or quality safeguards.
  • No Hire: Only suggests "use a cheaper model" without architectural thinking.

Interview Cheat Sheet

TopicKey FactTypical Question
KV Cache2×L×Hkv×dh×S×B×bytes2 \times L \times H_{kv} \times d_h \times S \times B \times \text{bytes}"How much memory does the KV cache use?"
GQAReduces KV heads; Llama 2 70B uses 8:1 ratio"How does GQA reduce memory?"
Continuous BatchingIteration-level scheduling; evict/add per step"Why is continuous batching better than static?"
PagedAttentionOS-style virtual memory for KV cache; ~95% utilization"How does vLLM manage KV cache memory?"
Speculative DecodingMathematically lossless; uses min(1,pt/pd)\min(1, p_t/p_d) acceptance"Does speculative decoding change the output distribution?"
GPTQLayer-wise OBS; calibration data needed"How does GPTQ differ from naive rounding?"
AWQProtects salient channels via scaling"Why does AWQ outperform GPTQ at the same bit width?"
Flash AttentionTiled computation; O(S)O(S) memory"Why is Flash Attention faster despite more FLOPs?"
Latency vs ThroughputBatch size is the key lever"How do you optimize for latency vs throughput?"
Cost OptimizationCaching + routing + compression"How would you cut inference costs by 50%?"

Spaced Repetition Checkpoints

Day 0 (Today)

  • Explain why autoregressive decoding is memory-bandwidth-bound
  • Calculate KV cache size for a given model configuration
  • Describe continuous batching vs. static batching
  • List 4 quantization methods and their key differences

Day 3

  • Derive the speculative decoding acceptance probability
  • Explain PagedAttention and its memory utilization benefit
  • Compare vLLM, TensorRT-LLM, TGI, and Ollama
  • Calculate expected speedup from speculative decoding

Day 7

  • Design a cost optimization strategy for a $50K/month LLM deployment
  • Explain the latency-throughput trade-off with specific optimizations
  • Describe FlashDecoding and why it matters for long-context inference
  • Compare tensor parallelism vs. pipeline parallelism for inference

Day 14

  • Whiteboard a complete inference serving architecture (model, batching, caching, routing)
  • Explain GPTQ's second-order optimization in detail
  • Design a tiered model routing system with quality monitoring
  • Calculate memory requirements for a multi-model deployment

Day 21

  • Present a 30-minute deep dive on any inference optimization topic
  • Critique a given inference architecture and propose improvements
  • Explain disaggregated prefill/decode and when it is beneficial
  • Derive the roofline model crossover point for batch size

Cross-References

© 2026 EngineersOfAI. All rights reserved.