9 docs tagged with "model-serving"

Batch Inference Pipelines

Designing efficient batch inference pipelines for scoring millions of examples - architecture, GPU utilization, failure recovery, and production patterns.

Canary and Blue-Green Deployments for ML Models

Safe model rollout strategies - canary deployments for gradual traffic migration, blue-green for instant switch, and automated rollback triggers.

Module 03: Model Serving

Production patterns for serving ML model predictions - from protocol choice and batching to quantization, compilation, caching, and autoscaling.

Module 04: Real-Time ML Systems

Architecture patterns for real-time machine learning - from sub-10ms inference at scale to online learning, streaming inference pipelines, and ultra-low-latency optimization.

Online Feature Computation for Model Serving

How to compute ML features at request time without blowing your latency budget - caching strategies, vectorized computation, and production patterns.

REST vs gRPC for ML Model Serving

A production engineer's guide to choosing between REST and gRPC for ML APIs - protocol mechanics, performance trade-offs, and when each wins.

Shadow Deployment for Safe Model Releases

How to validate new ML models on real production traffic without affecting users - traffic mirroring, prediction comparison, and graduation criteria.

Synchronous vs Asynchronous Inference

When to use synchronous versus asynchronous inference patterns for ML systems - queue architectures, streaming, timeout handling, and production trade-offs.

Triton Inference Server and TorchServe

Production-grade ML serving frameworks - NVIDIA Triton's dynamic batching and multi-backend support, TorchServe's PyTorch-native serving, and when to use each.