Skip to main content

9 docs tagged with "model-serving"

View all tags

Batch Inference Pipelines

Designing efficient batch inference pipelines for scoring millions of examples - architecture, GPU utilization, failure recovery, and production patterns.

Module 03: Model Serving

Production patterns for serving ML model predictions - from protocol choice and batching to quantization, compilation, caching, and autoscaling.

Module 04: Real-Time ML Systems

Architecture patterns for real-time machine learning - from sub-10ms inference at scale to online learning, streaming inference pipelines, and ultra-low-latency optimization.

REST vs gRPC for ML Model Serving

A production engineer's guide to choosing between REST and gRPC for ML APIs - protocol mechanics, performance trade-offs, and when each wins.

Synchronous vs Asynchronous Inference

When to use synchronous versus asynchronous inference patterns for ML systems - queue architectures, streaming, timeout handling, and production trade-offs.

Triton Inference Server and TorchServe

Production-grade ML serving frameworks - NVIDIA Triton's dynamic batching and multi-backend support, TorchServe's PyTorch-native serving, and when to use each.