Skip to main content

Module 07: LLM Inference & Optimization

Training a large language model costs millions of dollars and runs for weeks. But inference - serving that model to real users - costs far more in aggregate. OpenAI serves billions of tokens per day. Meta serves LLaMA to hundreds of millions of users. Every millisecond of latency and every dollar of compute cost compounds at scale.

This module is about the systems that make LLM inference fast, efficient, and economically viable. You will learn how tokens are generated, why inference is hard to optimize, and the specific techniques - from KV caching to quantization to speculative decoding - that power every major inference stack in production today.

The Inference Optimization Stack

Each layer builds on the one below. Hardware sets absolute limits. Parallelism lets you use multiple GPUs. Batching fills those GPUs efficiently. KV cache avoids redundant computation. Quantization fits more into memory. Speculative decoding breaks the autoregressive bottleneck. Application-layer caching avoids the GPU entirely.

Lessons in This Module

#LessonCore Concept
01Autoregressive DecodingToken-by-token generation, prefill vs decode, roofline model
02KV CacheCaching K/V tensors, PagedAttention, MQA/GQA
03Sampling StrategiesTemperature, top-K, top-P, nucleus sampling
04Quantization INT8/INT4LLM.int8(), GPTQ, AWQ, GGUF formats
05Speculative DecodingDraft-verify, rejection sampling, Medusa
06Continuous BatchingIteration-level scheduling, Orca, chunked prefill
07Tensor & Pipeline ParallelismMulti-GPU inference, Megatron, 3D parallelism
08vLLM & Inference ServersvLLM, TGI, TensorRT-LLM, deployment patterns
09Inference Cost OptimizationCost modeling, semantic cache, model routing

Prerequisites

  • Transformer architecture (attention, feed-forward layers, residual connections)
  • Basic GPU programming concepts (memory bandwidth, FLOP counts)
  • Python proficiency - PyTorch, HuggingFace transformers
  • Module 01–06 of this LLMs track recommended

Key Concepts Glossary

TermDefinition
PrefillProcessing the input prompt in parallel - compute-bound
DecodeGenerating output tokens one at a time - memory-bandwidth-bound
TTFTTime To First Token - latency until first output token
TPOTTime Per Output Token - average time between generated tokens
KV CacheStored key/value tensors from attention layers to avoid recomputation
Arithmetic IntensityFLOPs per byte of memory accessed - determines roofline bound
QuantizationMapping float weights to lower-bit integers (INT8, INT4)
Speculative DecodingUse small draft model + large target model for lossless speedup
Continuous BatchingIteration-level scheduling - swap sequences in/out every decode step
PagedAttentionNon-contiguous KV cache blocks - OS paging for GPU memory
Tensor ParallelismSplit weight matrices across GPUs for single-request latency
Pipeline ParallelismSplit layers across GPUs for high-throughput serving
© 2026 EngineersOfAI. All rights reserved.