Skip to main content

TGI and Alternative Serving Frameworks

The Platform Team That Had to Support Everything

The AI platform team at a mid-sized fintech company spent six months building their internal LLM serving layer. They made a clean architectural bet: vLLM for everything. It was a reasonable choice in early 2024. Fast, high throughput, OpenAI-compatible.

Then the use cases multiplied. The fraud detection team needed a model running on a Jetson Orin at the edge, inferring in real time on transaction streams without a network hop to the cloud. vLLM does not run on ARM. The research team needed to quickly evaluate 40 different models from HuggingFace without writing deployment scripts for each one. The mobile team wanted to run a quantized 3B model on-device for a feature that could not send data to any server. The enterprise integration team needed to call commercial APIs (Claude, GPT-4) through the same interface as their internal models so they could A/B test them.

Six months of clean architecture, and now they had four different serving systems running in parallel, each with its own deployment pattern, monitoring setup, and failure modes. The on-call rotation was getting unwieldy.

This is not a story about bad engineering decisions. It is a story about the actual shape of the LLM serving landscape. The truth is that no single framework is the right answer for all deployment scenarios. vLLM is outstanding for GPU-accelerated production throughput. But there are eight other frameworks that each solve a specific deployment problem better than vLLM does.

Understanding when to use each framework - and how to build a unified serving layer that spans all of them - is the difference between a brittle point solution and a robust AI infrastructure that can evolve as your use cases grow.

This lesson covers the full landscape: HuggingFace TGI (the most direct vLLM alternative), Ollama (developer experience focus), llama.cpp server (CPU and Apple Silicon), LiteLLM (the unified proxy layer), Triton Inference Server (NVIDIA enterprise), and MLC-LLM (edge deployment). You will finish with a concrete architecture for building a multi-backend serving layer that handles all of these through one interface.


Why This Exists - The Problem With a Single Serving Stack

No Framework Wins Every Dimension

Serving frameworks make tradeoffs across several dimensions simultaneously:

  • Throughput vs latency: Continuous batching maximizes throughput but increases individual request latency
  • Hardware flexibility: GPU-optimized frameworks often do not run on CPU, ARM, or mobile chips
  • Model coverage: Some frameworks only support specific model architectures
  • Ecosystem integration: HuggingFace integration, tokenizer handling, model format support
  • Operational complexity: Docker images, Ray clusters, simple processes
  • Licensing and cost: Some enterprise solutions require paid licenses

No single framework sits at the optimal point on every dimension. vLLM wins on raw GPU throughput for transformer models. But if you need to run on a Mac, serve a developer on their laptop, call GPT-4 through the same API, or deploy on an edge device, vLLM is not the right tool.

The Integration Tax of Multiple Backends

The real cost of running multiple serving frameworks is not the frameworks themselves. It is the integration surface. Every framework has its own:

  • API format (OpenAI-compatible vs proprietary vs REST vs gRPC)
  • Health check endpoint
  • Metrics format
  • Authentication model
  • Configuration language
  • Startup and shutdown behavior

When your application code has to know which framework is serving which model, you have a hard dependency that makes switching frameworks painful. A model upgrade should not require changing application code.

This is why LiteLLM and similar proxy layers exist. They create a stable interface that your applications depend on, while the serving backends behind them can be swapped, upgraded, or migrated without touching application code.


Historical Context - How the Serving Landscape Evolved

TGI Came First (Relative to vLLM)

HuggingFace released Text Generation Inference (TGI) in the second half of 2022, predating the vLLM PagedAttention paper by about six months. TGI was the first production-grade, high-throughput serving framework specifically designed for transformer language models. It introduced continuous batching to the open-source ecosystem (based on the Orca paper) and was the serving infrastructure behind HuggingFace's own Inference Endpoints product.

When vLLM was released in June 2023 with PagedAttention, the throughput benchmarks were strikingly better than TGI for most workloads. This was a genuine architectural advance - the memory management improvements of PagedAttention were not something TGI could easily replicate by tuning. HuggingFace eventually integrated paged attention-style KV cache management into TGI, but vLLM had established a throughput lead for most GPU workloads.

The Developer Tools Gap

What vLLM optimized for (raw GPU throughput, production scale), it deprioritized (developer experience, ease of getting started, model discovery). This created an opening.

Ollama launched in 2023 targeting a completely different user: the developer on a MacBook who wants to run LLaMA 3 locally in two commands. No GPU required. No Docker. No configuration files. Just ollama run llama3 and a model starts downloading and running. Ollama treats the developer laptop as a first-class deployment environment, using llama.cpp under the hood but wrapping it in a polished CLI and API.

LiteLLM emerged around the same time as a proxy layer - not a serving framework itself, but a translation layer. It exposed a single OpenAI-compatible API and routed requests to any backend: vLLM, TGI, Ollama, Bedrock, Azure OpenAI, Anthropic, Cohere. One SDK to rule them all.

Together, these tools addressed the full deployment spectrum from edge to cloud, from laptop to datacenter.


HuggingFace TGI - Deep Dive

Architecture and Core Features

TGI (Text Generation Inference) is a Rust-based serving framework with Python bindings. The Rust core handles the server, HTTP routing, and request management. The Python layer handles model execution via PyTorch. This architecture choice (Rust server, Python model) gives TGI strong performance characteristics on the I/O and scheduling side while maintaining flexibility on the model side.

Core features:

  • Continuous batching (iteration-level scheduling)
  • Tensor parallelism (same concept as vLLM)
  • Flash Attention 2 integration
  • Quantization support: GPTQ, AWQ, bitsandbytes (4-bit and 8-bit), FP8
  • Streaming responses via Server-Sent Events
  • Water marking (for detecting AI-generated text)
  • Speculative decoding
  • Safetensors format support natively
  • HuggingFace Hub integration (model discovery, version management)

TGI Docker Deployment

# Basic TGI deployment
docker run \
--gpus all \
--shm-size 1g \
-p 8080:80 \
-v /mnt/models:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-3-8B-Instruct \
--num-shard 1 \
--max-total-tokens 4096 \
--max-input-length 3072 \
--max-batch-prefill-tokens 4096

# LLaMA 3 70B on 2 GPUs
docker run \
--gpus '"device=0,1"' \
--shm-size 4g \
-p 8080:80 \
-v /mnt/models:/data \
-e HUGGING_FACE_HUB_TOKEN="${HF_TOKEN}" \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-3-70B-Instruct \
--num-shard 2 \
--max-total-tokens 8192 \
--max-input-length 6144 \
--max-batch-prefill-tokens 8192 \
--quantize bitsandbytes-nf4

# With GPTQ quantization (requires pre-quantized model)
docker run \
--gpus all \
-p 8080:80 \
-v /mnt/models:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id TheBloke/Llama-2-70B-Chat-GPTQ \
--quantize gptq \
--max-total-tokens 4096

TGI Python Client

from huggingface_hub import InferenceClient

# TGI has its own Python client
client = InferenceClient(model="http://localhost:8080")

# Text generation
response = client.text_generation(
"Explain the transformer architecture in plain English:",
max_new_tokens=512,
temperature=0.7,
top_p=0.95,
stream=False,
)
print(response)

# Streaming
for token in client.text_generation(
"Write a Python class for a binary search tree:",
max_new_tokens=1024,
stream=True,
):
print(token, end="", flush=True)

# Chat format (TGI also supports OpenAI-compatible endpoint)
from openai import OpenAI

openai_client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed",
)

response = openai_client.chat.completions.create(
model="tgi",
messages=[
{"role": "user", "content": "What is gradient descent?"}
],
max_tokens=256,
)
print(response.choices[0].message.content)

TGI Configuration Parameters

# Key TGI flags and what they control

# --max-total-tokens
# Maximum total tokens (input + output) per sequence.
# Set this to your maximum prompt + completion length.
# TGI uses this to pre-allocate KV cache buckets.

# --max-input-length
# Maximum prompt length. Must be < max-total-tokens.
# TGI will error on prompts longer than this.

# --max-batch-prefill-tokens
# Maximum tokens in a single prefill batch.
# Higher = more tokens processed in parallel during prefill.
# Memory-intensive during spikes.

# --max-concurrent-requests
# Hard cap on concurrent requests.
# Requests beyond this get 503 responses.

# --waiting-served-ratio
# Ratio of waiting tokens to served tokens before triggering prefill.
# Higher = more aggressive batching of new requests.
# Default: 1.2 (new request batch starts when 20% more tokens waiting than being decoded)

# --max-batch-total-tokens
# Maximum tokens across all sequences in a decode batch.
# Primary lever for GPU utilization in TGI.
# Tune this for your GPU memory and throughput targets.

TGI vs vLLM - Direct Comparison

The choice between TGI and vLLM is genuinely close for many workloads. Here is an honest comparison.

vLLM advantages:

  • Higher throughput for most workloads (PagedAttention is a real improvement)
  • Better handling of variable-length sequences (memory fragmentation is lower)
  • Larger community, faster feature development in 2023-2024
  • Better speculative decoding implementation
  • Supports more model architectures (Mamba, Mixtral, etc.)

TGI advantages:

  • Native HuggingFace Hub integration (model version management, private repos)
  • Safetensors format as a first-class citizen (safer model loading)
  • Better bitsandbytes integration (dynamic quantization without pre-quantized model)
  • Rust core handles HTTP connection management more efficiently at extreme concurrency
  • Watermarking support (HuggingFace's SynthID-style token watermarking)
  • Longer track record in HuggingFace Inference Endpoints (battle-tested at scale)

Practical guidance: Use vLLM as your default for new deployments. Switch to TGI when:

  • You need tight HuggingFace Hub version management (model registries with Hub as source of truth)
  • The model you need is not yet supported by vLLM
  • You need bitsandbytes dynamic quantization (no pre-quantized model available)
  • You are running on HuggingFace Inference Endpoints (TGI is the native backend)
# Benchmark comparison script
# Run on the same hardware (single A100 80GB) with LLaMA 3 8B

import time
import asyncio
import httpx
from typing import List
import statistics

PROMPTS = [
"Explain the concept of attention in transformers in detail.",
"Write a merge sort implementation in Python with comments.",
"What are the key differences between supervised and reinforcement learning?",
"Describe the architecture of BERT and how it differs from GPT.",
] * 25 # 100 total prompts

async def benchmark_endpoint(
base_url: str,
model: str,
prompts: List[str],
concurrency: int = 10,
) -> dict:
"""Run concurrent requests and measure throughput + latency."""
semaphore = asyncio.Semaphore(concurrency)
latencies = []
start = time.time()

async def send_request(prompt: str) -> float:
async with semaphore:
req_start = time.time()
async with httpx.AsyncClient(timeout=120.0) as client:
response = await client.post(
f"{base_url}/v1/chat/completions",
json={
"model": model,
"messages": [{"role": "user", "content": prompt}],
"max_tokens": 256,
"temperature": 0.0,
},
)
response.raise_for_status()
data = response.json()
tokens = data["usage"]["completion_tokens"]
return time.time() - req_start, tokens

tasks = [send_request(p) for p in prompts]
results = await asyncio.gather(*tasks)

total_time = time.time() - start
latencies = [r[0] for r in results]
total_tokens = sum(r[1] for r in results)

return {
"total_time_s": total_time,
"throughput_req_s": len(prompts) / total_time,
"throughput_tok_s": total_tokens / total_time,
"p50_latency_s": statistics.median(latencies),
"p95_latency_s": statistics.quantiles(latencies, n=20)[18],
"p99_latency_s": statistics.quantiles(latencies, n=100)[98],
}

async def main():
print("Benchmarking vLLM (port 8000) vs TGI (port 8080)...")

vllm_results = await benchmark_endpoint(
"http://localhost:8000",
"meta-llama/Llama-3-8B-Instruct",
PROMPTS,
concurrency=10,
)

tgi_results = await benchmark_endpoint(
"http://localhost:8080",
"tgi",
PROMPTS,
concurrency=10,
)

print("\nvLLM Results:")
for k, v in vllm_results.items():
print(f" {k}: {v:.2f}")

print("\nTGI Results:")
for k, v in tgi_results.items():
print(f" {k}: {v:.2f}")

if __name__ == "__main__":
asyncio.run(main())

Ollama - Developer Experience First

What Ollama Is Optimizing For

Ollama is not competing with vLLM or TGI. It is solving a completely different problem: making it trivially easy for any developer (regardless of ML background) to run open-source LLMs on their local machine.

The design priorities are:

  1. One-command installation and one-command model startup
  2. Run on hardware you actually have (MacBook Pro, Windows machine, Linux laptop)
  3. Automatic model downloading and caching
  4. OpenAI-compatible API so existing code just works
  5. No configuration required to get started

Under the hood, Ollama uses llama.cpp for inference and handles model format conversion, quantization selection, and hardware detection automatically. When you run ollama run llama3 on an M2 MacBook Pro, it downloads the appropriate Q4_K_M quantized model, detects that Metal is available, and starts serving via MLX/Metal without you having to understand any of that.

# Installation
curl -fsSL https://ollama.ai/install.sh | sh

# Download and run a model
ollama run llama3 # LLaMA 3 8B (default)
ollama run llama3:70b # LLaMA 3 70B (requires enough RAM/VRAM)
ollama run mistral # Mistral 7B
ollama run codellama:34b # Code-focused variant
ollama run phi3 # Microsoft Phi-3 mini

# List downloaded models
ollama list

# Show model info
ollama show llama3

# Pull without running
ollama pull llama3:70b

Ollama REST API

Ollama runs a local server (default port 11434) with both its own API and an OpenAI-compatible layer:

# Using Ollama's own API
import httpx

response = httpx.post(
"http://localhost:11434/api/generate",
json={
"model": "llama3",
"prompt": "Why is the sky blue?",
"stream": False,
},
)
print(response.json()["response"])

# Using OpenAI SDK with Ollama (OpenAI-compatible endpoint)
from openai import OpenAI

client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # Ollama accepts any non-empty string
)

response = client.chat.completions.create(
model="llama3",
messages=[{"role": "user", "content": "What is machine learning?"}],
max_tokens=256,
)
print(response.choices[0].message.content)

# Streaming
stream = client.chat.completions.create(
model="llama3",
messages=[{"role": "user", "content": "Write a haiku about neural networks."}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)

Ollama Modelfile - Custom Models

Ollama supports a Modelfile format for customizing models with system prompts, parameters, and templates:

# Modelfile for a coding assistant
FROM llama3:8b

# System prompt
SYSTEM """
You are an expert software engineer specializing in Python and machine learning.
You write clean, well-documented code with type hints.
You explain your reasoning step by step before writing code.
"""

# Parameters
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 8192

# Custom template (overrides default)
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""
# Build and run the custom model
ollama create coding-assistant -f Modelfile
ollama run coding-assistant

When to Use Ollama

Ollama is the right choice for:

  • Local development and testing of prompts before production
  • Developer environments where ease of setup is more important than throughput
  • CI/CD pipelines that need a local LLM for testing (no GPU required)
  • Privacy-sensitive use cases where data cannot leave the device
  • Demos and prototyping

Ollama is not the right choice for:

  • Production serving at scale (limited concurrency, not designed for multi-user throughput)
  • Deployments requiring specific quantization methods or precision
  • Multi-GPU setups with tensor parallelism
  • When you need fine-grained control over batching and memory management

llama.cpp Server - CPU and Apple Silicon

What Makes llama.cpp Different

llama.cpp is Georgi Gerganov's C/C++ implementation of LLaMA inference. It runs on CPU, and with Metal/CUDA/OpenCL acceleration, on GPU. The key innovations:

  • GGUF model format (quantized model files, typically 2-8 bits per weight)
  • Aggressive CPU optimization (SIMD, AVX2, AVX-512)
  • Metal acceleration on Apple Silicon (runs on neural engine and GPU)
  • Quantization methods: Q4_0, Q4_K_M, Q5_K_M, Q8_0, F16, F32
  • Small binary size, minimal dependencies

The performance equation for llama.cpp on Apple Silicon is compelling. An M2 Max with 96GB unified memory can load LLaMA 3 70B in Q4_K_M quantization (~40GB) and achieve 8-15 tokens/second. That is real production throughput for latency-tolerant single-user applications, without a single NVIDIA GPU.

# Build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# CPU only
make

# With Metal (Apple Silicon)
make LLAMA_METAL=1

# With CUDA
make LLAMA_CUDA=1

# Start the server
./llama-server \
--model models/llama-3-8b-instruct-q4_k_m.gguf \
--host 0.0.0.0 \
--port 8080 \
--ctx-size 4096 \
--n-gpu-layers 40 \ # offload 40 layers to GPU (set to 99 for full GPU)
--threads 8 \ # CPU threads for layers not on GPU
--parallel 4 \ # concurrent sequences
--cont-batching # continuous batching

GGUF Quantization Levels

FormatBits/weightQualitySize (7B model)Tokens/sec (M2 Max)
Q2_K2.6Low2.7 GB45-60
Q4_04.0Good4.1 GB35-45
Q4_K_M4.4Very Good4.6 GB30-40
Q5_K_M5.7Near FP165.9 GB25-35
Q8_08.0Excellent8.1 GB18-25
F1616.0Reference16.0 GB10-15

For most use cases, Q4_K_M is the sweet spot: 95%+ of F16 quality at 27% of the memory footprint.

# Using llama.cpp server via OpenAI-compatible API
from openai import OpenAI

client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="llama.cpp",
)

response = client.chat.completions.create(
model="local-model", # llama.cpp accepts any model name
messages=[
{"role": "user", "content": "Explain backpropagation in simple terms."}
],
max_tokens=512,
)
print(response.choices[0].message.content)

LiteLLM - The Unified Proxy Layer

Why You Need a Proxy Layer

LiteLLM is not a model serving framework. It is a proxy/gateway that sits in front of all your model serving backends and presents a single OpenAI-compatible API to your applications.

The problem it solves: you have vLLM serving LLaMA 3 70B, TGI serving a fine-tuned Mistral, Ollama for developer testing, and direct calls to OpenAI and Anthropic for tasks where proprietary models are required. Each backend has slightly different API formats, different authentication, different error codes. Your application code is becoming a switch statement of API variations.

LiteLLM abstracts this away. Your application calls one endpoint. LiteLLM routes, translates, retries, and logs.

# Without LiteLLM - application must know about all backends
import openai
import anthropic

def generate(model: str, prompt: str) -> str:
if model.startswith("gpt"):
client = openai.OpenAI()
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
)
return response.choices[0].message.content
elif model.startswith("claude"):
client = anthropic.Anthropic()
response = client.messages.create(
model=model,
max_tokens=1024,
messages=[{"role": "user", "content": prompt}],
)
return response.content[0].text
elif model.startswith("llama"):
# vLLM endpoint
client = openai.OpenAI(base_url="http://vllm:8000/v1", api_key="na")
response = client.chat.completions.create(model=model, ...)
return response.choices[0].message.content
# etc...

# With LiteLLM - application knows nothing about backends
import litellm

def generate(model: str, prompt: str) -> str:
response = litellm.completion(
model=model,
messages=[{"role": "user", "content": prompt}],
)
return response.choices[0].message.content

# This works for all of these:
generate("gpt-4o", prompt)
generate("claude-3-5-sonnet-20241022", prompt)
generate("vllm/meta-llama/Llama-3-70B-Instruct", prompt)
generate("ollama/llama3", prompt)
generate("huggingface/mistralai/Mistral-7B-Instruct-v0.2", prompt)

Deploying LiteLLM Proxy Server

# litellm_config.yaml
model_list:
# Internal vLLM deployment
- model_name: llama-3-70b
litellm_params:
model: openai/meta-llama/Llama-3-70B-Instruct
api_base: http://vllm-service:8000/v1
api_key: "na"

- model_name: llama-3-8b
litellm_params:
model: openai/meta-llama/Llama-3-8B-Instruct
api_base: http://vllm-8b-service:8000/v1
api_key: "na"

# TGI deployment for fine-tuned model
- model_name: finetuned-mistral
litellm_params:
model: openai/tgi
api_base: http://tgi-finetuned:8080/v1
api_key: "na"

# Commercial APIs (fallback or A/B test)
- model_name: gpt-4o
litellm_params:
model: gpt-4o
api_key: "os.environ/OPENAI_API_KEY"

- model_name: claude-3-5-sonnet
litellm_params:
model: claude-3-5-sonnet-20241022
api_key: "os.environ/ANTHROPIC_API_KEY"

# Load balancing across multiple vLLM instances
- model_name: llama-3-70b-balanced
litellm_params:
model: openai/meta-llama/Llama-3-70B-Instruct
api_base: http://vllm-1:8000/v1
api_key: "na"
weight: 1

- model_name: llama-3-70b-balanced
litellm_params:
model: openai/meta-llama/Llama-3-70B-Instruct
api_base: http://vllm-2:8000/v1
api_key: "na"
weight: 1

# Rate limiting
router_settings:
routing_strategy: least-busy
num_retries: 3
timeout: 60
fallbacks: [{"llama-3-70b": ["gpt-4o-mini"]}] # fallback if internal model fails

litellm_settings:
success_callback: ["langfuse"] # observability integration
failure_callback: ["langfuse"]
set_verbose: False
# Start LiteLLM proxy
pip install litellm[proxy]

litellm --config litellm_config.yaml \
--port 4000 \
--host 0.0.0.0

# Docker deployment
docker run -d \
-p 4000:4000 \
-v $(pwd)/litellm_config.yaml:/app/config.yaml \
-e OPENAI_API_KEY="${OPENAI_API_KEY}" \
-e ANTHROPIC_API_KEY="${ANTHROPIC_API_KEY}" \
ghcr.io/berriai/litellm:main-latest \
--config /app/config.yaml

LiteLLM for A/B Testing

# Route 10% of traffic to GPT-4o, 90% to LLaMA 3 70B
# Useful for measuring quality difference on real traffic

import litellm
from litellm import Router
import random

router = Router(
model_list=[
{
"model_name": "production-llm",
"litellm_params": {
"model": "openai/meta-llama/Llama-3-70B-Instruct",
"api_base": "http://vllm:8000/v1",
"api_key": "na",
},
"model_info": {"id": "llama-70b"},
},
{
"model_name": "production-llm-gpt",
"litellm_params": {
"model": "gpt-4o-mini",
"api_key": "sk-...",
},
"model_info": {"id": "gpt-4o-mini"},
},
]
)

def route_request(prompt: str) -> tuple[str, str]:
"""Route with 10/90 split for A/B testing."""
model = "production-llm-gpt" if random.random() < 0.10 else "production-llm"
response = router.completion(
model=model,
messages=[{"role": "user", "content": prompt}],
)
model_used = response._hidden_params.get("model_id", "unknown")
return response.choices[0].message.content, model_used

NVIDIA Triton Inference Server - Enterprise Grade

When Triton Is the Right Choice

Triton Inference Server (formerly TensorRT Inference Server) is NVIDIA's enterprise-grade serving solution. It is significantly more complex to set up than vLLM or TGI but offers capabilities that matter in specific enterprise contexts:

  • Model ensemble pipelines: Chain multiple models together (preprocessing model, LLM, postprocessing model) in a single inference graph with low-latency data movement between stages
  • Multiple framework support: TensorRT, PyTorch, ONNX, TensorFlow, Python all in one server
  • Dynamic batching with priority queues: More sophisticated batching policies than vLLM/TGI
  • Model repository management: Standardized directory structure, hot model swapping
  • gRPC and HTTP/REST: Both protocols, useful for high-performance internal service calls

For LLM serving specifically, Triton is used with TensorRT-LLM backend, which compiles the model to a TensorRT engine for maximum GPU efficiency on NVIDIA hardware.

# Triton with TensorRT-LLM backend
# Step 1: Convert model to TensorRT-LLM format
python tensorrt_llm/examples/llama/convert_checkpoint.py \
--model_dir ./llama-3-8b \
--output_dir ./trt-llm-ckpt \
--dtype float16

trtllm-build \
--checkpoint_dir ./trt-llm-ckpt \
--output_dir ./trt-engine \
--gemm_plugin float16 \
--max_batch_size 32 \
--max_input_len 2048 \
--max_output_len 512

# Step 2: Set up Triton model repository
mkdir -p model_repo/llama_3_8b/1
cp ./trt-engine/* model_repo/llama_3_8b/1/
# Write config.pbtxt for the model...

# Step 3: Start Triton server
docker run -it --rm \
--gpus all \
--net host \
-v $(pwd)/model_repo:/models \
nvcr.io/nvidia/tritonserver:24.01-trtllm-python-py3 \
tritonserver --model-repository=/models

Triton's complexity is justified when: you are building an enterprise ML platform serving many different model types, you need the absolute maximum GPU efficiency on NVIDIA hardware (TensorRT engines are faster than PyTorch), or you need model ensemble pipelines with GPU-resident intermediate results.

For most LLM-only deployments, vLLM or TGI delivers 90-95% of Triton's performance with a fraction of the operational complexity.


MLC-LLM - Edge and Mobile Deployment

The Edge Inference Problem

Cloud deployment is easy to reason about: big GPU, big memory, network connection. Edge deployment is a different world. A Jetson Orin NX has 16GB unified memory and a 1024-CUDA-core GPU. An iPhone 15 Pro has a Neural Engine that runs specific quantized models fast. A Raspberry Pi 5 has 8GB RAM and runs on ARM Cortex-A76.

No mainstream serving framework targets these environments. MLC-LLM (Machine Learning Compilation for Large Language Models) from the TVM team at the University of Washington targets exactly these environments.

MLC-LLM uses Apache TVM's compilation pipeline to generate device-specific optimized code for the target hardware. The same model can be compiled for:

  • NVIDIA GPUs (CUDA)
  • Apple Silicon (Metal)
  • ARM CPUs (NEON SIMD)
  • Android (OpenCL, Vulkan)
  • iOS (Metal, CoreML)
  • WebGPU (browser inference)
# Using MLC-LLM Python API
from mlc_llm import MLCEngine

# Load a compiled model
engine = MLCEngine("HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC")

# Chat completion
response = engine.chat.completions.create(
messages=[{"role": "user", "content": "What is machine learning?"}],
model="Llama-3-8B-Instruct-q4f16_1",
max_tokens=256,
)
print(response.choices[0].message.content)

engine.terminate()
# Compile a model for a specific target
# Target: CUDA (for NVIDIA GPU)
mlc_llm compile HF://meta-llama/Llama-3-8B-Instruct \
--quantization q4f16_1 \
--device cuda \
--output ./compiled-models/llama-3-8b-cuda

# Target: Metal (for Apple Silicon)
mlc_llm compile HF://meta-llama/Llama-3-8B-Instruct \
--quantization q4f16_1 \
--device metal \
--output ./compiled-models/llama-3-8b-metal

# Start REST server
mlc_llm serve ./compiled-models/llama-3-8b-cuda \
--host 0.0.0.0 \
--port 8080

For iOS/Android deployment, MLC-LLM provides SDKs that bundle the compiled model and runtime into the mobile app. This enables fully offline inference with no network dependency - the model runs entirely on device.


Building a Multi-Backend Serving Layer

Architecture

Complete Docker Compose Setup

# docker-compose-serving-stack.yml
version: "3.8"

services:
# LiteLLM proxy - unified interface
litellm:
image: ghcr.io/berriai/litellm:main-latest
ports:
- "4000:4000"
volumes:
- ./litellm_config.yaml:/app/config.yaml
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY}
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
- LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}
command: ["--config", "/app/config.yaml", "--detailed_debug"]
depends_on:
vllm:
condition: service_healthy
tgi:
condition: service_healthy

# vLLM - primary GPU serving
vllm:
image: vllm/vllm-openai:latest
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=0,1
- HF_TOKEN=${HF_TOKEN}
ports:
- "8000:8000"
volumes:
- model-cache:/root/.cache/huggingface
command: >
python -m vllm.entrypoints.openai.api_server
--model meta-llama/Llama-3-70B-Instruct
--tensor-parallel-size 2
--gpu-memory-utilization 0.90
--max-model-len 8192
--host 0.0.0.0
--port 8000
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 5
start_period: 180s # 3 minutes for model to load

# TGI - fine-tuned model serving
tgi:
image: ghcr.io/huggingface/text-generation-inference:latest
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=2
- HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
ports:
- "8080:80"
volumes:
- model-cache:/data
command: >
--model-id my-org/finetuned-mistral-7b
--num-shard 1
--max-total-tokens 4096
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:80/health"]
interval: 30s
timeout: 10s
retries: 5
start_period: 120s

# Ollama - developer use, runs on CPU
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama-models:/root/.ollama
environment:
- OLLAMA_HOST=0.0.0.0

# Prometheus
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml

# Grafana
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
volumes:
- grafana-data:/var/lib/grafana

volumes:
model-cache:
driver: local
driver_opts:
type: none
o: bind
device: /mnt/model-cache
ollama-models:
grafana-data:

Framework Selection Decision Tree

When a new serving requirement comes in, apply this decision process:

1. Is the target hardware a GPU (NVIDIA)?
YES - continue to step 2
NO - go to step 6

2. Is throughput (requests/second) the primary concern?
YES - use vLLM (PagedAttention advantage)
NO - go to step 3

3. Is HuggingFace Hub model versioning critical, or do you need bitsandbytes
dynamic quantization without a pre-quantized model?
YES - use TGI
NO - use vLLM (simpler, higher throughput for most cases)

4. Is this a developer laptop or CI environment?
YES - use Ollama (no GPU required, two-command setup)
NO - continue

5. Do you need to unify multiple backends behind one API?
YES - add LiteLLM proxy in front of vLLM/TGI
NO - serve vLLM/TGI directly

6. (No NVIDIA GPU path)
Apple Silicon (M1/M2/M3)? - use Ollama or llama.cpp with Metal
ARM edge device? - use llama.cpp (optimized ARM NEON) or MLC-LLM
iOS/Android? - use MLC-LLM (compiled, on-device)
x86 CPU server? - use llama.cpp server or Ollama
Need model ensemble pipelines? - use Triton Inference Server

Production Engineering Notes

Health Check Standardization

Each framework has different health check endpoints. When using LiteLLM as a proxy, health checks on the proxy itself are more reliable than checking individual backends (LiteLLM can mark a backend unhealthy and reroute).

# Unified health check script for multi-backend setup
import httpx
import asyncio

BACKENDS = {
"litellm-proxy": "http://localhost:4000/health",
"vllm": "http://localhost:8000/health",
"tgi": "http://localhost:8080/health",
"ollama": "http://localhost:11434/api/tags", # Ollama has no /health endpoint
}

async def check_backend(name: str, url: str) -> tuple[str, bool, str]:
try:
async with httpx.AsyncClient(timeout=5.0) as client:
response = await client.get(url)
is_healthy = response.status_code == 200
return name, is_healthy, f"HTTP {response.status_code}"
except Exception as e:
return name, False, str(e)

async def health_check_all():
tasks = [check_backend(name, url) for name, url in BACKENDS.items()]
results = await asyncio.gather(*tasks)
for name, healthy, detail in results:
status = "OK" if healthy else "FAIL"
print(f"{name:20s} {status:6s} {detail}")

asyncio.run(health_check_all())

Model Format Compatibility Matrix

Not all model formats are supported by all frameworks. This is a common source of frustration.

Model FormatvLLMTGIOllamallama.cppTriton
SafetensorsYesYes (preferred)Via conversionNoVia conversion
PyTorch binYesYesVia conversionNoVia conversion
GGUFNoNoYes (native)Yes (native)No
GPTQYesYesVia conversionPartialVia TensorRT
AWQYesYesNoNoVia TensorRT
TensorRT engineNoNoNoNoYes (required)

When you need to serve on multiple frameworks, Safetensors is the most portable source format. Convert to GGUF for llama.cpp/Ollama, and to quantized Safetensors for vLLM/TGI.

Latency Budget Breakdown

Understanding where latency comes from helps you optimize the right thing:

total latency=queue wait+TTFT+(output tokens×TPOT)\text{total latency} = \text{queue wait} + \text{TTFT} + (\text{output tokens} \times \text{TPOT})

Where TTFT is time to first token (dominated by prefill compute) and TPOT is time per output token (dominated by KV cache memory bandwidth).

For a 70B model on 2x A100s:

  • TTFT for 512-token prompt: 150-250ms
  • TPOT at low load: 20-35ms per token
  • TPOT at high load (many concurrent sequences): 50-100ms per token

Queue wait is the variable part. At low traffic it is zero. At peak traffic it can be seconds. This is why horizontal scaling matters: adding more instances reduces queue wait, which has the largest impact on tail latency (p95/p99).


Common Mistakes

:::danger Running TGI without setting --max-batch-total-tokens TGI's default batch token limit is quite conservative. Without tuning --max-batch-total-tokens, you may see very low GPU utilization (20-30%) even under significant load. The scheduler is holding back because the token budget is exhausted before the GPU is. Start at max_batch_total_tokens = 32768 for an 80GB GPU and tune from there with load testing. Monitor GPU utilization with nvidia-smi dmon while running your load test. :::

:::danger Using Ollama for production multi-user serving Ollama is built for single-user developer experience. It does not implement continuous batching effectively for multiple concurrent users. Under concurrent load, requests will queue rather than batch. For any production workload with more than one user, switch to vLLM or TGI. Ollama is for development, CI testing, and single-user deployment only. :::

:::warning LiteLLM proxy adds latency - measure it LiteLLM is a Python process with its own HTTP handling. It adds 5-30ms of overhead per request depending on load. For interactive applications where p95 TTFT target is 300ms, this is significant. Measure the overhead in your environment. If LiteLLM latency is too high, consider direct routing for the hottest paths and using LiteLLM only as a fallback/secondary routing layer. :::

:::warning llama.cpp model format lock-in GGUF files are not compatible with vLLM or TGI. If you start serving with llama.cpp and later need to migrate to vLLM (because you added GPU capacity), you will need the original Safetensors or PyTorch checkpoint. Always keep the original model weights in a portable format. Do not delete the source weights after converting to GGUF. :::

:::warning TGI and vLLM version drift on model support Both frameworks are moving fast. A model supported by vLLM 0.3 may not be supported by vLLM 0.4 yet (new architectures like Mamba, Gemma, or new Mistral variants take time to land). Always check the release notes when upgrading. Pin framework versions in your Docker images and test model compatibility explicitly before upgrading in production. :::


Interview Q&A

Q1: When would you choose TGI over vLLM for production deployment?

The key scenarios where TGI wins:

First, when HuggingFace Hub is your model registry. TGI has first-class Hub integration - you can point it at a Hub model ID including private repos, specific commits, and branches. If your team pushes fine-tuned model versions to Hub and wants to deploy specific commits with confidence, TGI's Hub integration is cleaner than vLLM's.

Second, when you need bitsandbytes dynamic quantization. TGI can load a float16 model and quantize it on-the-fly using bitsandbytes (--quantize bitsandbytes-nf4). vLLM requires pre-quantized models in AWQ or GPTQ format. If the model is new and no pre-quantized version exists yet, TGI is faster to get into production.

Third, when you are already running HuggingFace Inference Endpoints. TGI is the native serving backend there, so you benefit from HuggingFace's operational expertise and SLAs.

For most other GPU serving scenarios, vLLM's higher throughput (due to PagedAttention) makes it the better default.

Q2: Explain the role of LiteLLM in a production AI infrastructure.

LiteLLM is a translation and routing layer, not a serving framework. It presents a single OpenAI-compatible API to your applications while routing requests to the appropriate backend: self-hosted vLLM, TGI, Ollama, or commercial APIs like OpenAI or Anthropic.

The business value is decoupling. Your application code takes a dependency on the LiteLLM proxy, not on specific model backends. When you want to migrate from vLLM to TGI, change providers, or add a new model, you update the LiteLLM config, not the application code. This is especially valuable in organizations where ML engineers manage the serving infrastructure and application engineers write the product code - the two teams can evolve independently.

LiteLLM also handles: automatic fallbacks (if vLLM is unhealthy, route to OpenAI), load balancing across multiple vLLM instances, request logging for cost tracking and evaluation, and rate limiting per API key.

Q3: A data scientist on your team wants to experiment with 10 different 7B models over the next week. What serving setup do you recommend?

Ollama is the right answer here. The data scientist can ollama pull any model from the HuggingFace GGUF catalog, run it locally with ollama run, and call it via the OpenAI-compatible API from their notebook. No infrastructure setup required, no GPU needed if they have a MacBook Pro with 32GB unified memory (Q4_K_M for 7B models is ~4.6GB, leaves plenty of RAM).

If they need to run on GPU for speed, or need more than one model running simultaneously, deploy a shared Ollama instance or a small vLLM cluster in a dev environment. The key is separating experimentation infrastructure (low friction, disposable) from production infrastructure (high reliability, controlled).

For production, whichever models pass evaluation get wrapped in a proper vLLM deployment with the standard configuration and monitoring stack.

Q4: How does llama.cpp achieve competitive inference speed on Apple Silicon without dedicated AI accelerators?

Three mechanisms work together:

First, quantization. Q4_K_M quantization reduces a 7B model from 14GB (float16) to ~4.6GB. This means the model fits entirely in Apple's unified memory pool and can be fully accessed by both CPU and GPU without memory copies. The GPU memory bandwidth on M2 Pro (200 GB/s) becomes the throughput ceiling rather than the model size.

Second, Metal GPU integration. llama.cpp compiles Metal kernels for matrix multiplication (the dominant operation in transformer forward passes). The Apple GPU handles these efficiently, and the unified memory architecture means there is no PCIe bandwidth bottleneck between CPU-resident KV cache and GPU-resident weights.

Third, SIMD on CPU. For layers offloaded to CPU (or for full CPU inference), llama.cpp uses ARM NEON and, on x86, AVX2/AVX-512 SIMD instructions to process multiple quantized weights simultaneously. The custom quantization kernels are carefully written to maximize SIMD utilization.

The result: LLaMA 3 8B Q4_K_M on M2 Pro achieves 30-40 tokens/second. That is below A100 performance (150-200 tokens/second) but entirely practical for single-user applications.

Q5: Design a serving architecture for a company that needs: (a) high-throughput production serving for a customer-facing feature, (b) fast experimentation with new models, (c) fallback to commercial APIs when internal models fail.

Three-layer architecture:

Layer 1 - Experimentation: Ollama on developer machines (or a shared GPU server with Ollama) for rapid model evaluation. No production dependencies, easy to spin up new models.

Layer 2 - Production serving: vLLM cluster with tensor parallelism for the primary models. Multiple instances behind a load balancer. Auto-scaling based on num_requests_waiting metric. Private model registry (HuggingFace Hub private org or self-hosted) for versioned model artifacts.

Layer 3 - Unified interface: LiteLLM proxy in front of everything. Routes production traffic to vLLM by default. Has OpenAI/Anthropic as configured fallbacks. When vLLM instances are unhealthy (health check fails), LiteLLM automatically routes to commercial APIs with a configurable fallback rule.

Monitoring: Prometheus scrapes vLLM metrics (throughput, queue depth, KV cache utilization) and LiteLLM metrics (routing decisions, fallback frequency, cost per model). Alert when queue depth grows (scale trigger) or when fallback to commercial APIs is frequent (indicates underprovisioning or model quality issues).

This architecture gives you a stable application-facing interface, maximum throughput for production traffic, and the ability to experiment without touching production infrastructure.

Q6: What is GGUF and why does it matter for LLM deployment?

GGUF (GPT-Generated Unified Format) is the model file format used by llama.cpp and adopted by Ollama, LM Studio, and most CPU/edge inference tools. It is a binary format that bundles model weights, tokenizer vocabulary, and model metadata (architecture, hyperparameters) into a single file.

The key properties: GGUF supports mixed-precision quantization at the tensor level (some tensors quantized more aggressively than others), it is memory-mappable (the OS can page in only the parts of the file currently needed), and it is self-contained (no separate tokenizer files, config files, or code).

For deployment, GGUF matters because it is the standard format for running models on hardware that cannot run PyTorch: CPU servers, Apple Silicon, ARM edge devices. The quantization quality has improved significantly - Q4_K_M (4-bit with mixed precision for critical tensors) achieves 95%+ of float16 quality for most tasks.

The limitation: GGUF is not directly supported by vLLM or TGI. If you have a model in GGUF format and need to serve it with vLLM for high throughput, you need to convert back to a PyTorch/Safetensors checkpoint, which requires the original training checkpoint. This is why the rule of thumb is to keep original weights in Safetensors and generate GGUF as a secondary artifact.


Summary

The LLM serving landscape is not a competition with one winner. It is a spectrum of tools, each optimized for a different deployment context.

vLLM wins on raw GPU throughput and should be your production default for NVIDIA GPU deployments.

TGI wins on HuggingFace ecosystem integration and is the right choice when Hub model management or bitsandbytes dynamic quantization is a priority.

Ollama wins on developer experience and is the correct choice for local development, experimentation, and single-user deployments.

llama.cpp wins on hardware flexibility - CPU, Apple Silicon, ARM, x86 - and is the production standard for edge and on-device inference.

LiteLLM is not a serving framework but a glue layer that makes all of the above interoperable. Deploy it as a proxy and your application code never has to know which framework is running which model.

The architecture that scales: vLLM clusters for production GPU serving, Ollama for developer environments, LiteLLM as the unified interface, commercial APIs as fallback. Monitor with Prometheus at every layer. This setup handles everything from one developer on a laptop to millions of production requests per day.

The next lesson covers Kubernetes autoscaling for LLM workloads - how to scale vLLM deployments horizontally based on request queue depth and KV cache pressure, with zero-downtime rolling updates when you upgrade model versions.

© 2026 EngineersOfAI. All rights reserved.