Module 7: Production Deployment of Open Models
Running a model locally with Ollama and running it in production at scale are completely different engineering problems. Local inference prioritizes convenience. Production serving prioritizes throughput, reliability, cost efficiency, and observability. The tools and architecture change significantly.
This module covers the production deployment stack for open source models - from selecting your serving framework to autoscaling to cost analysis to monitoring what is actually happening in production.
The Production Serving Stack
Lessons in This Module
| # | Lesson | Key Concept |
|---|---|---|
| 1 | vLLM Architecture and Setup | PagedAttention, continuous batching, configuration |
| 2 | Text Generation Inference | HuggingFace TGI, Flash Attention 2, tensor parallelism |
| 3 | Serving Multiple LoRA Adapters | punica, S-LoRA, dynamic adapter loading |
| 4 | Autoscaling Open Model Serving | Kubernetes HPA, custom metrics, scale-to-zero |
| 5 | Cost Comparison: Open vs API | TCO analysis, tokens/dollar at different scales |
| 6 | Monitoring Open Model Deployments | Latency, throughput, queue depth, TTFT, token/sec |
| 7 | Ollama in Production | Limitations, when it works, configuration for production |
| 8 | Open Model Deployment Architecture | Reference architecture for production open model serving |
Key Concepts You Will Master
- Continuous batching - how vLLM achieves high GPU utilization across variable-length requests
- S-LoRA - serving thousands of LoRA adapters simultaneously with a single base model
- Kubernetes autoscaling for LLMs - why standard HPA does not work and custom metrics are required
- TCO calculation - the exact math for comparing open model serving costs against API costs
- Production monitoring - the specific metrics that matter for LLM serving (not just HTTP 200)
Prerequisites
- Running Locally
- LoRA and QLoRA
- Basic Kubernetes / Docker knowledge
- AI Systems Track helpful context
© 2026 EngineersOfAI. All rights reserved.
