Module 03: Model Serving
Training a model is the beginning, not the end. The harder problem is making that model available to real systems - reliably, at scale, with predictable latency - while keeping the ability to update it without breaking anything.
This module covers every layer of the model serving stack: the protocol you expose, how you batch requests to extract GPU efficiency, how to compress models for faster inference, how to compile computation graphs for peak throughput, how to cache results intelligently, how to serve many models on shared infrastructure, how to scale under load, and how to know when things go wrong before users do.
Module Map
Lessons at a Glance
| # | Lesson | Core Question |
|---|---|---|
| 01 | Serving Architectures | REST, gRPC, or WebSocket - which protocol wins at 10K QPS? |
| 02 | Batching Strategies | Why is my GPU at 20% utilization at peak load, and how do I fix it? |
| 03 | Model Quantization | How do I get 4× speedup with acceptable accuracy loss? |
| 04 | Compilation & Optimization | How does TensorRT turn 8ms inference into 2ms? |
| 05 | Caching for ML | How do I reduce LLM API costs by 70% with semantic caching? |
| 06 | Multi-Model Serving | How do I run 50 fine-tuned models on shared GPU without chaos? |
| 07 | Inference Scaling | How do I scale from 10 to 200 replicas in 3 minutes during a launch? |
| 08 | Monitoring Serving | How do I catch serving degradation before users notice? |
Key Capabilities You Will Build
By the end of this module you will be able to:
- Choose between REST, gRPC, and WebSocket based on concrete latency and throughput requirements
- Implement dynamic batching to maximize GPU utilization without increasing p99 latency
- Quantize models from FP32 to INT8 with calibration datasets and measure accuracy tradeoffs
- Compile inference graphs with TensorRT and torch.compile for production throughput
- Layer result caching, KV caching, and prefix caching to reduce cost and latency
- Design multi-model serving infrastructure with resource isolation and A/B routing
- Configure autoscaling policies for ML workloads using KEDA with custom GPU metrics
- Build monitoring dashboards with latency histograms, GPU utilization, and drift signals
Why Model Serving is Hard
Most engineers underestimate model serving. They think: wrap the model in FastAPI, add a /predict endpoint, done. That works until:
- Your model grows to 7B parameters and needs GPU batching to be economically viable
- You need to update the model without dropping in-flight requests
- Your p99 latency SLA is 50ms but a single forward pass takes 40ms unoptimized
- You have 200 models and one infrastructure team, not one team per model
- A model update quietly degrades business metrics in ways that only show up after millions of requests
Each lesson in this module addresses one of these failure modes directly, with production-grade solutions.
:::tip Prerequisites You should be comfortable with HTTP/2 and network fundamentals, basic Python async programming, and have seen a FastAPI or Flask serving setup before. GPU/CUDA familiarity helps for the optimization lessons but is not required - the concepts are explained from first principles. :::
