Skip to main content

Module 03: Model Serving

Training a model is the beginning, not the end. The harder problem is making that model available to real systems - reliably, at scale, with predictable latency - while keeping the ability to update it without breaking anything.

This module covers every layer of the model serving stack: the protocol you expose, how you batch requests to extract GPU efficiency, how to compress models for faster inference, how to compile computation graphs for peak throughput, how to cache results intelligently, how to serve many models on shared infrastructure, how to scale under load, and how to know when things go wrong before users do.

Module Map

Lessons at a Glance

#LessonCore Question
01Serving ArchitecturesREST, gRPC, or WebSocket - which protocol wins at 10K QPS?
02Batching StrategiesWhy is my GPU at 20% utilization at peak load, and how do I fix it?
03Model QuantizationHow do I get 4× speedup with acceptable accuracy loss?
04Compilation & OptimizationHow does TensorRT turn 8ms inference into 2ms?
05Caching for MLHow do I reduce LLM API costs by 70% with semantic caching?
06Multi-Model ServingHow do I run 50 fine-tuned models on shared GPU without chaos?
07Inference ScalingHow do I scale from 10 to 200 replicas in 3 minutes during a launch?
08Monitoring ServingHow do I catch serving degradation before users notice?

Key Capabilities You Will Build

By the end of this module you will be able to:

  • Choose between REST, gRPC, and WebSocket based on concrete latency and throughput requirements
  • Implement dynamic batching to maximize GPU utilization without increasing p99 latency
  • Quantize models from FP32 to INT8 with calibration datasets and measure accuracy tradeoffs
  • Compile inference graphs with TensorRT and torch.compile for production throughput
  • Layer result caching, KV caching, and prefix caching to reduce cost and latency
  • Design multi-model serving infrastructure with resource isolation and A/B routing
  • Configure autoscaling policies for ML workloads using KEDA with custom GPU metrics
  • Build monitoring dashboards with latency histograms, GPU utilization, and drift signals

Why Model Serving is Hard

Most engineers underestimate model serving. They think: wrap the model in FastAPI, add a /predict endpoint, done. That works until:

  • Your model grows to 7B parameters and needs GPU batching to be economically viable
  • You need to update the model without dropping in-flight requests
  • Your p99 latency SLA is 50ms but a single forward pass takes 40ms unoptimized
  • You have 200 models and one infrastructure team, not one team per model
  • A model update quietly degrades business metrics in ways that only show up after millions of requests

Each lesson in this module addresses one of these failure modes directly, with production-grade solutions.

:::tip Prerequisites You should be comfortable with HTTP/2 and network fundamentals, basic Python async programming, and have seen a FastAPI or Flask serving setup before. GPU/CUDA familiarity helps for the optimization lessons but is not required - the concepts are explained from first principles. :::

© 2026 EngineersOfAI. All rights reserved.