Module 7: Production Deployment of Open Models

Running a model locally with Ollama and running it in production at scale are completely different engineering problems. Local inference prioritizes convenience. Production serving prioritizes throughput, reliability, cost efficiency, and observability. The tools and architecture change significantly.

This module covers the production deployment stack for open source models - from selecting your serving framework to autoscaling to cost analysis to monitoring what is actually happening in production.

The Production Serving Stack

Lessons in This Module

#	Lesson	Key Concept
1	vLLM Architecture and Setup	PagedAttention, continuous batching, configuration
2	Text Generation Inference	HuggingFace TGI, Flash Attention 2, tensor parallelism
3	Serving Multiple LoRA Adapters	punica, S-LoRA, dynamic adapter loading
4	Autoscaling Open Model Serving	Kubernetes HPA, custom metrics, scale-to-zero
5	Cost Comparison: Open vs API	TCO analysis, tokens/dollar at different scales
6	Monitoring Open Model Deployments	Latency, throughput, queue depth, TTFT, token/sec
7	Ollama in Production	Limitations, when it works, configuration for production
8	Open Model Deployment Architecture	Reference architecture for production open model serving

Key Concepts You Will Master

Continuous batching - how vLLM achieves high GPU utilization across variable-length requests
S-LoRA - serving thousands of LoRA adapters simultaneously with a single base model
Kubernetes autoscaling for LLMs - why standard HPA does not work and custom metrics are required
TCO calculation - the exact math for comparing open model serving costs against API costs
Production monitoring - the specific metrics that matter for LLM serving (not just HTTP 200)

Prerequisites

Running Locally
LoRA and QLoRA
Basic Kubernetes / Docker knowledge
AI Systems Track helpful context

The Production Serving Stack​

Lessons in This Module​

Key Concepts You Will Master​

Prerequisites​

The Production Serving Stack

Lessons in This Module

Key Concepts You Will Master

Prerequisites