Skip to main content

Module 7: Production Deployment of Open Models

Running a model locally with Ollama and running it in production at scale are completely different engineering problems. Local inference prioritizes convenience. Production serving prioritizes throughput, reliability, cost efficiency, and observability. The tools and architecture change significantly.

This module covers the production deployment stack for open source models - from selecting your serving framework to autoscaling to cost analysis to monitoring what is actually happening in production.

The Production Serving Stack

Lessons in This Module

#LessonKey Concept
1vLLM Architecture and SetupPagedAttention, continuous batching, configuration
2Text Generation InferenceHuggingFace TGI, Flash Attention 2, tensor parallelism
3Serving Multiple LoRA Adapterspunica, S-LoRA, dynamic adapter loading
4Autoscaling Open Model ServingKubernetes HPA, custom metrics, scale-to-zero
5Cost Comparison: Open vs APITCO analysis, tokens/dollar at different scales
6Monitoring Open Model DeploymentsLatency, throughput, queue depth, TTFT, token/sec
7Ollama in ProductionLimitations, when it works, configuration for production
8Open Model Deployment ArchitectureReference architecture for production open model serving

Key Concepts You Will Master

  • Continuous batching - how vLLM achieves high GPU utilization across variable-length requests
  • S-LoRA - serving thousands of LoRA adapters simultaneously with a single base model
  • Kubernetes autoscaling for LLMs - why standard HPA does not work and custom metrics are required
  • TCO calculation - the exact math for comparing open model serving costs against API costs
  • Production monitoring - the specific metrics that matter for LLM serving (not just HTTP 200)

Prerequisites

© 2026 EngineersOfAI. All rights reserved.