01Module 07: LLM Inference & OptimizationMaster the systems and techniques that make large language model inference fast, efficient, and cost-effective at production scale.02Autoregressive DecodingUnderstand how LLMs generate tokens one at a time, why decoding is memory-bandwidth bound, and how to reason about inference latency with the roofline model.03KV CacheLearn how the key-value cache eliminates redundant attention computation in LLM inference, and how PagedAttention solves the memory fragmentation problem.04Sampling Strategies: Temperature, Top-K, Top-PMaster the sampling algorithms that control LLM output diversity - from greedy decoding to nucleus sampling - and learn when to use each in production.05Quantization: INT8 and INT4Master LLM quantization techniques - from LLM.int8() to GPTQ and AWQ - to run large models on commodity hardware without unacceptable quality loss.06Speculative DecodingLearn how speculative decoding uses a small draft model to generate tokens that a large target model verifies in parallel, achieving 2-3x speedup with no quality loss.07Continuous BatchingLearn how continuous batching eliminates GPU idle time by replacing finished sequences immediately rather than waiting for the longest request in a batch to complete.08Tensor and Pipeline ParallelismLearn how tensor parallelism splits weight matrices across GPUs and pipeline parallelism splits model layers, enabling inference and training of models too large for a single GPU.09vLLM and Inference ServersLearn how production inference servers like vLLM, TGI, TensorRT-LLM, and Ollama combine PagedAttention, continuous batching, and optimized kernels to serve LLMs at scale.10Inference Cost OptimizationLearn how to systematically reduce LLM inference costs using model selection, quantization, caching, request routing, prompt compression, and infrastructure strategies.