Module 07: LLM Inference & Optimization
Training a large language model costs millions of dollars and runs for weeks. But inference - serving that model to real users - costs far more in aggregate. OpenAI serves billions of tokens per day. Meta serves LLaMA to hundreds of millions of users. Every millisecond of latency and every dollar of compute cost compounds at scale.
This module is about the systems that make LLM inference fast, efficient, and economically viable. You will learn how tokens are generated, why inference is hard to optimize, and the specific techniques - from KV caching to quantization to speculative decoding - that power every major inference stack in production today.
The Inference Optimization Stack
Each layer builds on the one below. Hardware sets absolute limits. Parallelism lets you use multiple GPUs. Batching fills those GPUs efficiently. KV cache avoids redundant computation. Quantization fits more into memory. Speculative decoding breaks the autoregressive bottleneck. Application-layer caching avoids the GPU entirely.
Lessons in This Module
| # | Lesson | Core Concept |
|---|---|---|
| 01 | Autoregressive Decoding | Token-by-token generation, prefill vs decode, roofline model |
| 02 | KV Cache | Caching K/V tensors, PagedAttention, MQA/GQA |
| 03 | Sampling Strategies | Temperature, top-K, top-P, nucleus sampling |
| 04 | Quantization INT8/INT4 | LLM.int8(), GPTQ, AWQ, GGUF formats |
| 05 | Speculative Decoding | Draft-verify, rejection sampling, Medusa |
| 06 | Continuous Batching | Iteration-level scheduling, Orca, chunked prefill |
| 07 | Tensor & Pipeline Parallelism | Multi-GPU inference, Megatron, 3D parallelism |
| 08 | vLLM & Inference Servers | vLLM, TGI, TensorRT-LLM, deployment patterns |
| 09 | Inference Cost Optimization | Cost modeling, semantic cache, model routing |
Prerequisites
- Transformer architecture (attention, feed-forward layers, residual connections)
- Basic GPU programming concepts (memory bandwidth, FLOP counts)
- Python proficiency - PyTorch, HuggingFace transformers
- Module 01–06 of this LLMs track recommended
Key Concepts Glossary
| Term | Definition |
|---|---|
| Prefill | Processing the input prompt in parallel - compute-bound |
| Decode | Generating output tokens one at a time - memory-bandwidth-bound |
| TTFT | Time To First Token - latency until first output token |
| TPOT | Time Per Output Token - average time between generated tokens |
| KV Cache | Stored key/value tensors from attention layers to avoid recomputation |
| Arithmetic Intensity | FLOPs per byte of memory accessed - determines roofline bound |
| Quantization | Mapping float weights to lower-bit integers (INT8, INT4) |
| Speculative Decoding | Use small draft model + large target model for lossless speedup |
| Continuous Batching | Iteration-level scheduling - swap sequences in/out every decode step |
| PagedAttention | Non-contiguous KV cache blocks - OS paging for GPU memory |
| Tensor Parallelism | Split weight matrices across GPUs for single-request latency |
| Pipeline Parallelism | Split layers across GPUs for high-throughput serving |
