01Module 05: Model CompressionMaster the full spectrum of model compression techniques - quantization, pruning, distillation, and LoRA - to deploy large language models efficiently in production.02Why Model Compression MattersThe memory wall, inference costs, edge deployment, and latency requirements that make model compression essential for production AI systems - with real cost math, a full compression taxonomy, and decision frameworks for choosing the right technique.03Quantization Deep DiveINT8, INT4, NF4, FP8, and block-wise quantization explained from first principles - how floating point becomes integer, what accuracy you lose, and how to tune quantization for production LLM inference.04GPTQ: Post-Training QuantizationGPTQ explained from first principles - how Hessian-based error compensation quantizes 175B models to 4-bit in hours, the role of calibration data, group size, activation reordering, and how to deploy GPTQ models in production with vLLM and autoGPTQ.05AWQ: Activation-Aware Weight QuantizationAWQ protects the 1% of weights that matter most - how activation statistics reveal salient weights, how scaling preserves them without extra memory, why AWQ outperforms GPTQ at INT4 for production inference, and how to configure Marlin kernels for maximum throughput.06Knowledge Distillation for LLMsTraining smaller student models to match larger teacher models - soft labels, temperature scaling, intermediate representation matching, API-based distillation, and a complete production pipeline for task-specific deployment.07Structured PruningRemove entire attention heads, MLP neurons, and transformer layers to achieve real hardware latency improvements - with production-grade code for Taylor importance, angular distance layer scoring, iterative recovery, and combined compression pipelines.08Unstructured PruningWeight-level sparsity, the Lottery Ticket Hypothesis, SparseGPT, Wanda, and 2:4 structured sparsity - why unstructured pruning is theoretically elegant but practically limited for LLMs.09LoRA for Efficient Fine-TuningLoRA and QLoRA: fine-tune 70B models on a single GPU by freezing the base model and training only small low-rank adapter matrices - the technique that democratized LLM customization.10Benchmarking Compressed ModelsHow to systematically evaluate accuracy-efficiency tradeoffs in quantized, pruned, and distilled models - perplexity, task-specific capabilities, latency, throughput, and automated regression detection.