01Module 03: Model ServingProduction patterns for serving ML model predictions - from protocol choice and batching to quantization, compilation, caching, and autoscaling.02REST vs gRPC for ML Model ServingA production engineer's guide to choosing between REST and gRPC for ML APIs - protocol mechanics, performance trade-offs, and when each wins.03Serving Architectures: REST vs gRPC vs WebSocketHow to choose the right serving protocol for ML models - REST, gRPC, and WebSocket compared across latency, throughput, streaming, and operational complexity.04Batching Strategies for InferenceHow static, dynamic, and continuous batching work - and how to go from 20% GPU utilization to 85% without increasing latency.05Synchronous vs Asynchronous InferenceWhen to use synchronous versus asynchronous inference patterns for ML systems - queue architectures, streaming, timeout handling, and production trade-offs.06Batch Inference PipelinesDesigning efficient batch inference pipelines for scoring millions of examples - architecture, GPU utilization, failure recovery, and production patterns.07Model Quantization for Production InferenceHow quantization reduces model size and inference latency - from FP32 to INT8 to INT4 - covering PTQ, QAT, GPTQ, AWQ, and GGUF with accuracy tradeoffs.08Model Compilation and OptimizationCompiler-level optimizations for ML inference - TensorRT, torch.compile, ONNX export, kernel fusion, layer fusion, XLA, and profiling bottlenecks.09Online Feature Computation for Model ServingHow to compute ML features at request time without blowing your latency budget - caching strategies, vectorized computation, and production patterns.10Caching for ML ServingHow to reduce ML serving cost and latency with result caching, semantic similarity caching, KV cache for transformers, prefix caching for LLMs, and feature caching with Redis.11Shadow Deployment for Safe Model ReleasesHow to validate new ML models on real production traffic without affecting users - traffic mirroring, prediction comparison, and graduation criteria.12Canary and Blue-Green Deployments for ML ModelsSafe model rollout strategies - canary deployments for gradual traffic migration, blue-green for instant switch, and automated rollback triggers.13Multi-Model ServingHow to serve hundreds of models efficiently - model multiplexing, ensembles in production, A/B testing infrastructure, shadow mode, canary deployments, and multi-tenant GPU resource isolation.14Inference ScalingHorizontal and vertical scaling for ML inference - autoscaling policies, KEDA with custom GPU metrics, spot instances, global load balancing, and handling traffic spikes.15Monitoring ML Serving in ProductionProduction monitoring for ML serving - inference latency histograms, GPU metrics, throughput monitoring, error rates, distributed tracing with OpenTelemetry, and drift detection.16Triton Inference Server and TorchServeProduction-grade ML serving frameworks - NVIDIA Triton's dynamic batching and multi-backend support, TorchServe's PyTorch-native serving, and when to use each.