9 docs tagged with "production-deployment"

Kubernetes and Auto-Scaling for LLMs

Deploy LLMs on Kubernetes with GPU scheduling, HPA and KEDA for autoscaling, MIG partitioning on A100/H100, and Karpenter for on-demand GPU node provisioning.

Load Balancing and Request Routing

Load balancing strategies for LLM serving - prefix-aware routing for KV cache reuse, least-connections for variable-cost requests, model routing, circuit breakers, and building a production gateway.

Model Versioning and Canary Releases

Managing model versions in production LLM serving - semantic versioning for models, canary deployments, A/B testing, shadow mode evaluation, rollback procedures, and blue-green model deployments.

Module 7: Production Deployment of Open Models

vLLM, Text Generation Inference, multi-adapter serving, autoscaling, and cost analysis - deploying open source models at production scale.

Monitoring LLM Services

Production observability for LLM serving systems - GPU metrics, TTFT, inter-token latency, vLLM Prometheus integration, distributed tracing, alerting, and Grafana dashboards.

Multi-Model Serving Architecture

Serving multiple LLMs from shared infrastructure - model routing, MIG partitioning, dynamic loading, LiteLLM proxy, cost optimization through bin-packing, and autoscaling per model in production.

Rate Limiting and Cost Control

Controlling costs and preventing abuse in LLM API serving - token-based rate limiting, Redis token buckets, tenant isolation, cost attribution, budget alerts, and abuse detection.

TGI and Alternative Serving Frameworks

Compare HuggingFace TGI, Ollama, LiteLLM, Triton Inference Server, and llama.cpp for LLM deployment - feature analysis, performance benchmarks, and when to use each framework.

vLLM Architecture and Deployment

Deploy open LLMs at production scale using vLLM - PagedAttention, continuous batching, tensor parallelism, and OpenAI-compatible serving for LLaMA 3 70B and beyond.