Module 05: Model Compression

Modern LLMs are enormous. GPT-3 has 175 billion parameters. Llama-3-70B has 70 billion. Even "small" models like Mistral-7B consume 14 GB of GPU memory in full FP32 precision. Deploying these models at scale - with low latency, reasonable cost, and on constrained hardware - requires model compression.

This module covers every major compression technique used in production AI systems today: quantization (INT8, INT4, GPTQ, AWQ), knowledge distillation, structured and unstructured pruning, and parameter-efficient fine-tuning with LoRA. You will understand not just how these techniques work, but why they work, when to use each one, and how to benchmark the results honestly.

Module Roadmap

Lessons at a Glance

#	Lesson	Core Techniques	Difficulty
01	Why Compression Matters	Memory wall, cost taxonomy	Beginner
02	Quantization Deep Dive	INT8/INT4, PTQ, QAT, calibration	Intermediate
03	GPTQ	Hessian, OBS, block quantization	Advanced
04	AWQ	Salient channels, per-channel scaling	Advanced
05	Knowledge Distillation	Temperature, feature/response distillation	Intermediate
06	Structured Pruning	Head pruning, layer dropping, importance scores	Intermediate
07	Unstructured Pruning	Magnitude pruning, SparseGPT, 2:4 sparsity	Advanced
08	LoRA and QLoRA	Low-rank adapters, PEFT, DoRA	Intermediate
09	Benchmarking Compressed Models	Perplexity, MMLU, throughput, latency	Intermediate

Prerequisites

Familiarity with transformer architecture (attention heads, MLP layers, weight matrices)
Basic Python and PyTorch
Understanding of GPU memory concepts (VRAM, bandwidth)

What You Will Build

By the end of this module you will be able to:

Quantize a 7B LLM to INT4 and serve it on a single consumer GPU
Run GPTQ and AWQ quantization on any HuggingFace model
Distill a large teacher model into a smaller student
Prune attention heads and measure the accuracy-speed tradeoff
Fine-tune a quantized model with QLoRA for custom tasks
Benchmark compressed models rigorously and report results honestly

Module Roadmap​

Lessons at a Glance​

Prerequisites​

What You Will Build​

Module Roadmap

Lessons at a Glance

Prerequisites

What You Will Build