Module 07: LLM Inference & Optimization

Training a large language model costs millions of dollars and runs for weeks. But inference - serving that model to real users - costs far more in aggregate. OpenAI serves billions of tokens per day. Meta serves LLaMA to hundreds of millions of users. Every millisecond of latency and every dollar of compute cost compounds at scale.

This module is about the systems that make LLM inference fast, efficient, and economically viable. You will learn how tokens are generated, why inference is hard to optimize, and the specific techniques - from KV caching to quantization to speculative decoding - that power every major inference stack in production today.

The Inference Optimization Stack

Each layer builds on the one below. Hardware sets absolute limits. Parallelism lets you use multiple GPUs. Batching fills those GPUs efficiently. KV cache avoids redundant computation. Quantization fits more into memory. Speculative decoding breaks the autoregressive bottleneck. Application-layer caching avoids the GPU entirely.

Lessons in This Module

#	Lesson	Core Concept
01	Autoregressive Decoding	Token-by-token generation, prefill vs decode, roofline model
02	KV Cache	Caching K/V tensors, PagedAttention, MQA/GQA
03	Sampling Strategies	Temperature, top-K, top-P, nucleus sampling
04	Quantization INT8/INT4	LLM.int8(), GPTQ, AWQ, GGUF formats
05	Speculative Decoding	Draft-verify, rejection sampling, Medusa
06	Continuous Batching	Iteration-level scheduling, Orca, chunked prefill
07	Tensor & Pipeline Parallelism	Multi-GPU inference, Megatron, 3D parallelism
08	vLLM & Inference Servers	vLLM, TGI, TensorRT-LLM, deployment patterns
09	Inference Cost Optimization	Cost modeling, semantic cache, model routing

Prerequisites

Transformer architecture (attention, feed-forward layers, residual connections)
Basic GPU programming concepts (memory bandwidth, FLOP counts)
Python proficiency - PyTorch, HuggingFace transformers
Module 01–06 of this LLMs track recommended

Key Concepts Glossary

Term	Definition
Prefill	Processing the input prompt in parallel - compute-bound
Decode	Generating output tokens one at a time - memory-bandwidth-bound
TTFT	Time To First Token - latency until first output token
TPOT	Time Per Output Token - average time between generated tokens
KV Cache	Stored key/value tensors from attention layers to avoid recomputation
Arithmetic Intensity	FLOPs per byte of memory accessed - determines roofline bound
Quantization	Mapping float weights to lower-bit integers (INT8, INT4)
Speculative Decoding	Use small draft model + large target model for lossless speedup
Continuous Batching	Iteration-level scheduling - swap sequences in/out every decode step
PagedAttention	Non-contiguous KV cache blocks - OS paging for GPU memory
Tensor Parallelism	Split weight matrices across GPUs for single-request latency
Pipeline Parallelism	Split layers across GPUs for high-throughput serving

The Inference Optimization Stack​

Lessons in This Module​

Prerequisites​

Key Concepts Glossary​

The Inference Optimization Stack

Lessons in This Module

Prerequisites

Key Concepts Glossary