Ollama and Local Model Management

The Team That Couldn't Agree on Python Environments

Your team has eight engineers. Three of them successfully ran local LLMs last week. The other five couldn't reproduce it. One hit a CUDA version mismatch between their system drivers and the version of PyTorch they had installed. Another spent two hours debugging the difference between llama-cpp-python built with Metal support versus without it. Two more gave up after their virtual environments conflicted with each other in ways that produced errors nobody had seen before. The eighth engineer quietly spent Friday afternoon rebuilding their machine.

By Monday you have a different problem: the three engineers who got things working are all using different models, different quantization levels, and different prompt formats. The code they're writing doesn't share any infrastructure. One has a Python script that loads the model directly. Another set up llama.cpp compiled from source with a custom shell script. The third discovered something called Ollama, got everything working in fifteen minutes, and quietly said nothing because they didn't want to start another environment debate.

That Friday afternoon quietness is where Ollama earns its reputation. It solves a problem that isn't really about LLMs at all - it's about making software easy to install, run, and share across a team with different machines, different operating systems, and different levels of patience for debugging build systems.

The mental model for Ollama is Docker for LLMs. Docker solved the "it works on my machine" problem for containerized applications by bundling the application with its entire runtime environment. Ollama solves the "it works on my laptop" problem for language models by bundling the inference engine, model weights, model configuration, and a standardized API into a single managed service. You install Ollama once. You pull a model with one command. You run it with one command. The model works the same way on your laptop as it does on your colleague's laptop as it does on your production server.

This lesson covers Ollama end-to-end: the internals of how it manages models, the Modelfile format for customization, the REST API, Python integration, and how to build a complete local AI stack suitable for a small team or a production internal deployment.

Why This Exists - The Three Problems Before Ollama

Before Ollama (released in mid-2023), running a local LLM involved three distinct problems that each required separate solutions:

Problem 1: Installation complexity. llama.cpp requires compiling from source with the right CMake flags for your hardware. PyTorch-based inference requires matching CUDA versions, installing transformers, handling tokenizer dependencies, and managing Python environment conflicts. A new team member getting started with local LLMs was guaranteed to spend at least an afternoon debugging their setup. This wasn't a one-time cost - system updates, new projects, and new machines all triggered the same problem.

Problem 2: Model management. Once you could run a model, where did you put it? How did you switch between models? How did you share model configurations with your team? There was no standard answer. Models were stored in ~/Downloads, in git repos (problematic given sizes), in S3 buckets with ad-hoc download scripts, or on a shared NFS mount. Prompt templates and system prompts were hardcoded in scripts, not versioned with the model.

Problem 3: API consistency. Every inference library had a different API. llama.cpp's Python bindings had one interface. The Hugging Face transformers pipeline API had another. A custom llama.cpp server had a third. Code written against one couldn't be reused with another. Switching from a 7B model to a 13B model might require changing code, not just changing a model path.

Ollama solves all three at once. It ships as a single binary with an embedded inference engine (llama.cpp under the hood). It has a model registry and local model management system analogous to Docker's image management. And it exposes a consistent OpenAI-compatible REST API regardless of which model you're running, making it trivially easy to switch models without changing application code.

The cost is some flexibility. Ollama abstracts away the fine-grained control that llama.cpp's command line gives you. For production edge cases where you need to tune every inference parameter, you may eventually reach for llama.cpp directly. For everything else - development, prototyping, small-team deployments, internal tools - Ollama's simplicity wins.

History - From Tiny Corp to the Go-To Local Inference Tool

Ollama was created by Jeffrey Morgan and his team at a company originally called "Tiny Corp" - later renamed to Ollama Inc. The first public release appeared on GitHub in late June 2023, just a few months after llama.cpp established that local LLM inference was viable.

The founding insight was that llama.cpp had solved the hard technical problem - quantized inference on consumer hardware - but hadn't solved the user experience problem. The llama.cpp CLI was a developer tool for developers who were comfortable compiling C++ and running command-line arguments. The vast majority of engineers who wanted to experiment with local LLMs needed something with the ergonomics of docker pull and docker run.

Morgan described the design goal in an early Hacker News comment: "we wanted it to feel like Docker, because everyone already knows how Docker works." This analogy turned out to be exactly right. Engineers immediately understood ollama pull llama3, ollama run llama3, and ollama list because they already knew docker pull, docker run, and docker images. The mental model transferred directly.

The project grew faster than almost any other AI tooling project in the same period. Within six months of release, Ollama had over 30,000 GitHub stars. Within a year it had passed 100,000 stars. Open WebUI (a web interface for Ollama) reached similar adoption numbers almost entirely driven by Ollama's user base.

The "aha moment" for most early adopters was the Modelfile. The ability to create a custom model with a specific system prompt, a specific parameter set, and a specific base model - all in a five-line text file that could be committed to git and shared with a team - was something nobody had previously made easy. A team could maintain a Modelfile.support-bot that encoded their entire support AI configuration, version it like code, and deploy it with ollama create support-bot -f Modelfile.support-bot.

Core Concepts

How Ollama Manages Models

Ollama stores models locally in a content-addressable store, similar to how Docker stores image layers. On macOS and Linux, the default location is ~/.ollama/models/. The directory structure is:

~/.ollama/models/
  blobs/
    sha256-abc123...    <- model weights (GGUF binary data)
    sha256-def456...    <- another model or layer
  manifests/
    registry.ollama.ai/
      library/
        llama3.2/
          latest        <- JSON manifest referencing blobs
        mistral/
          latest

Each model is a manifest that references one or more blobs. The blobs are content-addressed by SHA256 hash, so if two models share a base (for example, two Modelfile variants both built on llama3.2), they share the same base blob and only store the diff. This is analogous to Docker layer caching.

When you run ollama pull llama3.2, Ollama:

Fetches the manifest for llama3.2:latest from registry.ollama.ai
Checks which blobs are already cached locally
Downloads only the missing blobs
Stores the manifest pointing to the blobs

When you run ollama run llama3.2, Ollama:

Loads the manifest to find the GGUF file blob
Starts an internal llama.cpp process with appropriate parameters
Exposes an HTTP endpoint for communication
Manages the process lifecycle (keeps it running for a TTL period after last use, then unloads)

The model stays loaded in memory as long as requests are coming in. After a configurable timeout (default: 5 minutes), Ollama unloads the model to free RAM. The next request triggers a reload. This is the key memory management strategy that allows running multiple models on one machine - only the actively used model occupies RAM at any given time.

The Modelfile Format

A Modelfile is a plain text file that defines a model configuration. The syntax borrows from Dockerfile:

# Minimum valid Modelfile - just sets a base model
FROM llama3.2

# Add a system prompt
SYSTEM """
You are a senior software engineer at a fintech company.
You specialize in Python backend systems and help teammates
debug code, review architecture decisions, and explain
complex technical concepts clearly.
"""

# Set inference parameters
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER num_predict 2048
PARAMETER num_ctx 8192

# Set the prompt template (usually not needed - base model template is correct)
# Only override if you know what you're doing

Modelfile instructions:

FROM - required. Specifies the base model. Can be a model name from the Ollama library (FROM llama3.2), a local GGUF file path (FROM ./models/custom-model.gguf), or another Ollama model (FROM my-base-model).

SYSTEM - the system prompt that precedes every conversation. The most commonly customized instruction. Defines the AI's persona, constraints, and knowledge context.

PARAMETER - sets inference hyperparameters. Common ones:

temperature (0.0-2.0): controls randomness. 0 = deterministic, 1.0 = default, 2.0 = very random
top_p (0-1): nucleus sampling threshold. Only sample from tokens comprising the top p probability mass
top_k (1-100): only sample from the top k most probable tokens
num_predict (-1 for unlimited, or a positive integer): max tokens to generate per response
num_ctx (512-128000): context window size. Larger values use more RAM
repeat_penalty (1.0-2.0): penalizes repeated tokens. Default 1.1, increase to 1.3+ for repetitive models
stop: a token string where generation stops. Can be specified multiple times for multiple stop tokens

TEMPLATE - the full prompt template including turn markers. Usually inherited from the base model's GGUF metadata. Only override when building models from raw base models or when the default template is wrong.

LICENSE - declares the license for custom model distributions.

MESSAGE - injects pre-set conversation turns. Rarely used, but lets you prime the conversation with example exchanges.

The REST API

Ollama exposes a REST API at http://localhost:11434. There are two flavors of API: Ollama's native format and the OpenAI-compatible format.

Native Ollama API:

POST /api/generate        - raw text generation (non-chat)
POST /api/chat            - chat with message history
POST /api/embeddings      - generate text embeddings
POST /api/pull            - pull a model
POST /api/push            - push a model to registry
POST /api/create          - create model from Modelfile
DELETE /api/delete        - delete a model
GET  /api/tags            - list local models
GET  /api/show            - show model info
POST /api/copy            - copy a model with a new name

OpenAI-compatible API (added in late 2023):

POST /v1/chat/completions
POST /v1/completions
POST /v1/embeddings
GET  /v1/models

The OpenAI-compatible endpoints make Ollama a drop-in replacement for the OpenAI API in existing codebases. Change the base URL and API key, keep everything else the same.

Setting Up Ollama - Installation and First Run

Installation

# macOS and Linux (single command)
curl -fsSL https://ollama.com/install.sh | sh

# macOS via Homebrew
brew install ollama

# Windows: download installer from https://ollama.com/download
# Linux manual install: the install.sh script handles systemd service setup

# Verify installation
ollama --version

On macOS, Ollama installs as a menu bar application. On Linux, the install script creates a systemd service that starts automatically. You can also run it manually:

# Start Ollama server manually (useful for custom configuration)
OLLAMA_HOST=0.0.0.0:11434 ollama serve

Pulling and Running Models

# Pull a model (downloads to ~/.ollama/models)
ollama pull llama3.2                      # 3B parameter model, ~2GB
ollama pull llama3.2:1b                   # 1B parameter variant
ollama pull llama3.1:8b                   # 8B parameter, ~5GB
ollama pull mistral:7b                    # Mistral 7B
ollama pull qwen2.5:14b                   # Qwen 2.5 14B, ~9GB
ollama pull codellama:13b                 # Code-specialized model
ollama pull nomic-embed-text              # Embedding model, ~274MB
ollama pull mxbai-embed-large             # High-quality embedding model

# Run a model interactively in the terminal
ollama run llama3.2
ollama run mistral

# Inside the interactive session:
# /help     - show commands
# /bye      - exit
# /clear    - clear conversation history
# /set parameter temperature 0.5  - change parameter mid-conversation

# Single-shot inference (no interactive session)
ollama run llama3.2 "What is the GGUF file format?"

# Pipe input from file
cat document.txt | ollama run llama3.2 "Summarize this document in 3 bullet points"

# List downloaded models
ollama list

# Remove a model
ollama rm llama3.2:latest

# Show model details (parameters, template, system prompt)
ollama show llama3.2
ollama show --modelfile llama3.2   # show the Modelfile

Creating Custom Models with Modelfile

Basic Custom Model

# Create a Modelfile for a support bot
cat > Modelfile.support << 'EOF'
FROM llama3.2

SYSTEM """
You are a technical support specialist for DataPipeline Pro, a data engineering platform.

Your role:
- Answer questions about DataPipeline Pro features and configuration
- Help users debug pipeline errors using the error messages they provide
- Escalate to human support by saying "ESCALATE:" when the issue requires account access
- Always ask for the relevant error message before attempting to diagnose

You do not:
- Discuss topics unrelated to DataPipeline Pro
- Provide information about competitor products
- Make promises about roadmap items or release dates

Tone: professional, concise, helpful. No jargon unless the user uses it first.
"""

PARAMETER temperature 0.2
PARAMETER num_ctx 4096
PARAMETER num_predict 1024
PARAMETER repeat_penalty 1.1
EOF

# Create the model in Ollama
ollama create support-bot -f Modelfile.support

# Run it
ollama run support-bot

# List to confirm it's there
ollama list

Model from a Local GGUF File

# If you have a fine-tuned model or a GGUF not in the Ollama library
cat > Modelfile.custom << 'EOF'
FROM ./models/my-finetuned-model-Q4_K_M.gguf

SYSTEM """
You are a specialized assistant trained on internal documentation.
"""

PARAMETER temperature 0.1
PARAMETER num_ctx 8192
EOF

ollama create internal-assistant -f Modelfile.custom
ollama run internal-assistant

Code Review Bot

cat > Modelfile.codereview << 'EOF'
FROM qwen2.5-coder:7b

SYSTEM """
You are a senior engineer performing code reviews. When given code:

1. Identify bugs and logic errors (mark as BUG:)
2. Flag security issues (mark as SECURITY:)
3. Note performance concerns (mark as PERF:)
4. Suggest style improvements (mark as STYLE:)
5. Praise good patterns (mark as GOOD:)

Be specific. Reference line numbers when possible. Don't comment on formatting
unless it causes readability issues. Be direct - no filler phrases.
"""

PARAMETER temperature 0.1
PARAMETER num_ctx 16384
PARAMETER num_predict 4096

EOF

ollama create code-reviewer -f Modelfile.codereview

Python Integration

Using the Ollama Python Library

pip install ollama

import ollama

# Simple text generation
response = ollama.generate(
    model="llama3.2",
    prompt="Explain gradient descent in one paragraph.",
)
print(response["response"])

# Chat completion
response = ollama.chat(
    model="llama3.2",
    messages=[
        {
            "role": "user",
            "content": "What is the difference between a process and a thread?"
        }
    ]
)
print(response["message"]["content"])

# Multi-turn conversation
messages = [
    {"role": "system", "content": "You are a Python expert."},
    {"role": "user", "content": "How do I reverse a list in Python?"},
]

response = ollama.chat(model="llama3.2", messages=messages)
print(response["message"]["content"])

# Append assistant response to maintain history
messages.append(response["message"])

# Follow-up question
messages.append({"role": "user", "content": "What about reversing a dictionary by its values?"})
response = ollama.chat(model="llama3.2", messages=messages)
print(response["message"]["content"])

Streaming Responses

import ollama

# Stream tokens as they're generated
print("Response: ", end="", flush=True)
for chunk in ollama.chat(
    model="llama3.2",
    messages=[{"role": "user", "content": "Write a Python function to parse JSON safely."}],
    stream=True
):
    content = chunk["message"]["content"]
    print(content, end="", flush=True)
print()  # newline after completion

Generating Embeddings

import ollama
import numpy as np

def cosine_similarity(a: list[float], b: list[float]) -> float:
    a_np = np.array(a)
    b_np = np.array(b)
    return float(np.dot(a_np, b_np) / (np.linalg.norm(a_np) * np.linalg.norm(b_np)))

# Generate embeddings
docs = [
    "Python is a high-level programming language known for readability.",
    "Rust provides memory safety without garbage collection.",
    "The Eiffel Tower is located in Paris, France.",
    "Machine learning models learn patterns from data.",
]

embeddings = []
for doc in docs:
    result = ollama.embeddings(model="nomic-embed-text", prompt=doc)
    embeddings.append(result["embedding"])

# Semantic similarity search
query = "What programming languages are safe from memory errors?"
query_embedding = ollama.embeddings(
    model="nomic-embed-text",
    prompt=query
)["embedding"]

similarities = [
    (docs[i], cosine_similarity(query_embedding, embeddings[i]))
    for i in range(len(docs))
]
similarities.sort(key=lambda x: x[1], reverse=True)

print(f"Query: {query}\n")
print("Ranked by similarity:")
for doc, score in similarities:
    print(f"  [{score:.3f}] {doc}")

Using the OpenAI SDK with Ollama

This is the most important integration pattern because it requires zero changes to existing code that already uses the OpenAI SDK - just change the base_url.

from openai import OpenAI

# Point OpenAI client at local Ollama
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # required by the SDK but value is ignored by Ollama
)

# Exact same code as OpenAI API - no other changes needed
response = client.chat.completions.create(
    model="llama3.2",  # Ollama model name instead of "gpt-4"
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is RAG in the context of LLMs?"}
    ],
    max_tokens=500,
    temperature=0.7
)

print(response.choices[0].message.content)

# Streaming also works
stream = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Count from 1 to 5 with one word descriptions."}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()

Building a Simple RAG Pipeline with Ollama

import ollama
import numpy as np
from pathlib import Path

class SimpleRAG:
    """
    Minimal RAG implementation using Ollama for both
    embedding generation and response generation.
    """

    def __init__(
        self,
        embed_model: str = "nomic-embed-text",
        chat_model: str = "llama3.2"
    ):
        self.embed_model = embed_model
        self.chat_model = chat_model
        self.documents: list[str] = []
        self.embeddings: list[list[float]] = []

    def add_document(self, text: str, chunk_size: int = 500):
        """Chunk a document and add to the index."""
        # Simple word-boundary chunking
        words = text.split()
        chunks = []
        current_chunk = []
        current_len = 0

        for word in words:
            current_chunk.append(word)
            current_len += len(word) + 1
            if current_len >= chunk_size:
                chunks.append(" ".join(current_chunk))
                current_chunk = []
                current_len = 0

        if current_chunk:
            chunks.append(" ".join(current_chunk))

        for chunk in chunks:
            result = ollama.embeddings(model=self.embed_model, prompt=chunk)
            self.documents.append(chunk)
            self.embeddings.append(result["embedding"])

        print(f"Added {len(chunks)} chunks from document.")

    def retrieve(self, query: str, top_k: int = 3) -> list[str]:
        """Return top_k most similar chunks to query."""
        q_embed = ollama.embeddings(
            model=self.embed_model,
            prompt=query
        )["embedding"]

        q_np = np.array(q_embed)
        similarities = []
        for i, emb in enumerate(self.embeddings):
            e_np = np.array(emb)
            score = float(np.dot(q_np, e_np) / (np.linalg.norm(q_np) * np.linalg.norm(e_np)))
            similarities.append((score, self.documents[i]))

        similarities.sort(reverse=True)
        return [doc for _, doc in similarities[:top_k]]

    def query(self, question: str) -> str:
        """Retrieve context and generate answer."""
        context_chunks = self.retrieve(question)
        context = "\n\n---\n\n".join(context_chunks)

        prompt = f"""Answer the question using only the provided context.
If the context doesn't contain enough information, say so clearly.

Context:
{context}

Question: {question}

Answer:"""

        response = ollama.generate(
            model=self.chat_model,
            prompt=prompt,
            options={"temperature": 0.1, "num_predict": 512}
        )
        return response["response"]


# Usage
rag = SimpleRAG()

# Index some content
rag.add_document("""
Ollama is a tool for running large language models locally.
It was created by Jeffrey Morgan and released in 2023.
Ollama uses llama.cpp as its inference backend and exposes
an OpenAI-compatible REST API at localhost port 11434.
Models are stored in ~/.ollama/models as GGUF files.
""")

answer = rag.query("Who created Ollama and when?")
print(answer)

Architecture Diagrams

Setting Up the Full Local AI Stack: Ollama + Open WebUI

Open WebUI is a browser-based chat interface for Ollama, similar to ChatGPT's UI but running entirely locally. Setting it up for a team is a common production use case.

Docker Compose Stack

# docker-compose.yml
version: '3.8'

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 30s
      timeout: 10s
      retries: 3

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    ports:
      - "3000:8080"
    volumes:
      - open_webui_data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
      - WEBUI_SECRET_KEY=your-secret-key-here-change-this
      - ENABLE_SIGNUP=false  # disable public registration
    depends_on:
      ollama:
        condition: service_healthy
    restart: unless-stopped

volumes:
  ollama_data:
  open_webui_data:

# Start the stack
docker compose up -d

# Pull initial models (run after containers start)
docker exec ollama ollama pull llama3.1:8b
docker exec ollama ollama pull nomic-embed-text

# Check logs
docker compose logs -f ollama
docker compose logs -f open-webui

# Access:
# Open WebUI: http://localhost:3000
# Ollama API: http://localhost:11434

Without Docker (Direct Install)

# 1. Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# 2. Configure Ollama to accept connections from all interfaces
# (needed when Open WebUI runs on a different container/machine)
sudo systemctl edit ollama.service
# Add under [Service]:
# Environment="OLLAMA_HOST=0.0.0.0"
sudo systemctl daemon-reload
sudo systemctl restart ollama

# 3. Install Open WebUI via pip
pip install open-webui

# 4. Start Open WebUI
OLLAMA_BASE_URL=http://localhost:11434 open-webui serve

# Or with uvicorn for production
OLLAMA_BASE_URL=http://localhost:11434 uvicorn open_webui.main:app \
  --host 0.0.0.0 \
  --port 3000 \
  --workers 2

# 5. Pull models
ollama pull llama3.1:8b
ollama pull nomic-embed-text

Nginx Configuration for Team Access

# /etc/nginx/sites-available/ai-internal
server {
    listen 443 ssl;
    server_name ai.internal.yourcompany.com;

    ssl_certificate /etc/ssl/certs/internal.crt;
    ssl_certificate_key /etc/ssl/private/internal.key;

    # Open WebUI - browser interface
    location / {
        proxy_pass http://127.0.0.1:3000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

        # Required for streaming responses
        proxy_buffering off;
        proxy_cache off;
        proxy_read_timeout 300s;
        proxy_send_timeout 300s;

        # SSE (Server-Sent Events) support
        proxy_set_header Connection '';
        proxy_http_version 1.1;
        chunked_transfer_encoding on;
    }

    # Ollama API - for programmatic access (internal only)
    location /api/ollama/ {
        # Restrict to internal network
        allow 10.0.0.0/8;
        deny all;

        rewrite ^/api/ollama/(.*) /$1 break;
        proxy_pass http://127.0.0.1:11434;
        proxy_set_header Host $host;
        proxy_buffering off;
        proxy_read_timeout 300s;
    }
}

Memory Management and Multi-Model Strategy

Understanding how Ollama manages memory is critical for running a multi-model setup without running out of RAM.

Model Load/Unload Behavior

# Check which models are currently loaded in memory
curl http://localhost:11434/api/ps

# Response:
# {
#   "models": [
#     {
#       "name": "llama3.2:latest",
#       "model": "llama3.2:latest",
#       "size": 2019993600,
#       "digest": "a80c4f17acd5...",
#       "details": {...},
#       "expires_at": "2024-01-15T14:35:22.123Z",
#       "size_vram": 2019993600
#     }
#   ]
# }

import requests
import json

def check_loaded_models():
    """Check which Ollama models are currently in memory."""
    response = requests.get("http://localhost:11434/api/ps")
    data = response.json()
    models = data.get("models", [])

    if not models:
        print("No models currently loaded in memory.")
        return

    print(f"Models in memory ({len(models)}):")
    for model in models:
        size_gb = model["size"] / (1024**3)
        vram_gb = model["size_vram"] / (1024**3)
        print(f"  {model['name']}")
        print(f"    Total RAM: {size_gb:.1f} GB")
        print(f"    VRAM:      {vram_gb:.1f} GB")
        print(f"    Expires:   {model['expires_at']}")

def preload_model(model_name: str, keep_alive: str = "10m"):
    """
    Preload a model into memory and keep it there.
    keep_alive: duration string like "10m", "1h", "-1" (never unload)
    """
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model_name,
            "keep_alive": keep_alive,
            "prompt": ""  # empty prompt just loads the model
        }
    )
    print(f"Preloaded {model_name} with keep_alive={keep_alive}")

def unload_model(model_name: str):
    """Force unload a model from memory."""
    requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model_name,
            "keep_alive": "0"  # 0 duration = unload immediately
        }
    )
    print(f"Unloaded {model_name}")

# Practical use: preload your primary model at startup
preload_model("llama3.1:8b", keep_alive="-1")  # never auto-unload

VRAM Planning for Multiple Models

A machine with 16GB unified memory (M2 Pro) can handle:

Scenario	Models Loaded	RAM Used	Remaining
Single chat	llama3.1:8b Q4_K_M	5.5 GB	10.5 GB
Chat + embeddings	llama3.1:8b + nomic-embed	5.8 GB	10.2 GB
Chat + code	llama3.1:8b + codellama:7b	10.5 GB	5.5 GB
All three	llama3.1:8b + codellama:7b + nomic-embed	10.8 GB	5.2 GB

For a 16GB machine, running three models simultaneously is viable if you keep context windows small (4K). For longer contexts or larger models, use the TTL-based unloading by setting appropriate keep_alive values per request.

Ollama vs. llama.cpp vs. LM Studio - When to Use Each

Ollama

Best for:

Teams that need a shared, consistent setup
Integrating local LLMs into Python/Node applications via API
Production internal deployments with a web UI (Open WebUI)
Anyone who wants model management without manual GGUF handling
CI/CD pipelines where you need a consistent inference service

Limitations:

Less control over llama.cpp inference parameters (not all flags exposed)
One model loaded at a time by default (TTL-based switching)
No fine-grained GPU layer control (Ollama decides how to use GPU)
No Windows GPU support as mature as macOS/Linux

llama.cpp (direct)

Best for:

Maximum performance tuning (control every inference parameter)
Unusual hardware configurations requiring specific flags
Converting and quantizing models from Hugging Face format
Embedding into applications where you don't want a separate service

Limitations:

Requires compilation for your hardware
No model management (you handle GGUF files yourself)
Steeper learning curve for configuration

LM Studio

Best for:

Non-technical users who want a GUI
Trying out many different models quickly
Windows users who want GPU support (LM Studio's Windows GPU support is mature)
One-person setups where a desktop app is preferable to a CLI

Limitations:

Not suitable for server or headless deployment
Cannot be scripted or automated
Closed source (unlike Ollama and llama.cpp)

The decision tree for most engineering teams: start with Ollama. If you hit a performance or control ceiling, switch the specific model to llama.cpp while keeping Ollama for everything else. Use LM Studio only on Windows for exploratory testing by non-technical users.

Production Engineering Notes

Running Ollama as a Systemd Service

# Check the default systemd service (installed automatically on Linux)
systemctl status ollama

# Customize the service
sudo systemctl edit ollama

# Add environment variables under [Service]:
# Environment="OLLAMA_HOST=0.0.0.0:11434"
# Environment="OLLAMA_NUM_PARALLEL=4"
# Environment="OLLAMA_MAX_LOADED_MODELS=3"
# Environment="OLLAMA_KEEP_ALIVE=10m"

sudo systemctl daemon-reload
sudo systemctl restart ollama

Key environment variables:

Variable	Default	Purpose
`OLLAMA_HOST`	`127.0.0.1:11434`	Bind address. Set to `0.0.0.0:11434` for network access
`OLLAMA_NUM_PARALLEL`	`1`	Max concurrent requests per model
`OLLAMA_MAX_LOADED_MODELS`	`1`	Max models in memory simultaneously
`OLLAMA_KEEP_ALIVE`	`5m`	Default model TTL after last request
`OLLAMA_MAX_QUEUE`	`512`	Max queued requests before returning 503
`OLLAMA_DEBUG`	`false`	Enable debug logging
`OLLAMA_FLASH_ATTENTION`	`0`	Enable flash attention (experimental)

Health Monitoring and Alerting

#!/usr/bin/env python3
"""
Ollama health monitor - checks model availability and response latency.
Useful for production deployments.
"""

import time
import requests
import json
from dataclasses import dataclass
from typing import Optional

@dataclass
class HealthResult:
    healthy: bool
    response_time_ms: float
    error: Optional[str] = None

def check_ollama_health(
    base_url: str = "http://localhost:11434",
    model: str = "llama3.2",
    timeout: int = 30
) -> HealthResult:
    """Check that Ollama is up and can generate tokens."""
    start = time.time()

    try:
        # First check: API is reachable
        tags_response = requests.get(
            f"{base_url}/api/tags",
            timeout=5
        )
        if tags_response.status_code != 200:
            return HealthResult(
                healthy=False,
                response_time_ms=(time.time() - start) * 1000,
                error=f"Tags endpoint returned {tags_response.status_code}"
            )

        # Second check: model exists
        models = [m["name"] for m in tags_response.json().get("models", [])]
        model_base = model.split(":")[0]
        if not any(m.startswith(model_base) for m in models):
            return HealthResult(
                healthy=False,
                response_time_ms=(time.time() - start) * 1000,
                error=f"Model {model} not found. Available: {models}"
            )

        # Third check: generation works
        gen_response = requests.post(
            f"{base_url}/api/generate",
            json={
                "model": model,
                "prompt": "ping",
                "stream": False,
                "options": {"num_predict": 3}
            },
            timeout=timeout
        )

        if gen_response.status_code != 200:
            return HealthResult(
                healthy=False,
                response_time_ms=(time.time() - start) * 1000,
                error=f"Generate returned {gen_response.status_code}"
            )

        elapsed_ms = (time.time() - start) * 1000
        return HealthResult(healthy=True, response_time_ms=elapsed_ms)

    except requests.exceptions.ConnectionError:
        return HealthResult(
            healthy=False,
            response_time_ms=(time.time() - start) * 1000,
            error="Connection refused - Ollama not running"
        )
    except requests.exceptions.Timeout:
        return HealthResult(
            healthy=False,
            response_time_ms=(time.time() - start) * 1000,
            error=f"Timeout after {timeout}s - model may be loading"
        )

# Run check
result = check_ollama_health()
print(f"Healthy: {result.healthy}")
print(f"Response time: {result.response_time_ms:.0f}ms")
if result.error:
    print(f"Error: {result.error}")

Automated Model Updates

#!/bin/bash
# update-models.sh - Pull latest versions of all installed models
# Run weekly via cron: 0 3 * * 0 /usr/local/bin/update-models.sh

OLLAMA_BIN="/usr/bin/ollama"
LOG_FILE="/var/log/ollama-updates.log"

echo "$(date) - Starting model update" >> "$LOG_FILE"

# Get list of installed models
MODELS=$("$OLLAMA_BIN" list | tail -n +2 | awk '{print $1}')

for model in $MODELS; do
    echo "$(date) - Updating $model" >> "$LOG_FILE"
    "$OLLAMA_BIN" pull "$model" >> "$LOG_FILE" 2>&1
    if [ $? -eq 0 ]; then
        echo "$(date) - Updated $model successfully" >> "$LOG_FILE"
    else
        echo "$(date) - Failed to update $model" >> "$LOG_FILE"
    fi
done

echo "$(date) - Model update complete" >> "$LOG_FILE"

Common Mistakes

:::danger Exposing Ollama to the internet without authentication By default, Ollama binds to 127.0.0.1 - accessible only from localhost. If you set OLLAMA_HOST=0.0.0.0, the API becomes accessible to anyone who can reach that port on your network, including the internet if your firewall isn't configured correctly.

Ollama has no built-in authentication. Anyone who reaches the API can run models, download new models (if OLLAMA_NOPRUNE is set), and enumerate what models are installed.

Fix: put a reverse proxy with authentication (nginx + basic auth, or Authelia for SSO) in front of Ollama. Never expose port 11434 directly to the internet. If running on a cloud instance, use a firewall rule to block 11434 from external access. :::

:::danger Sending sensitive data to cloud APIs after testing locally A common pattern: develop and test a feature locally with Ollama pointing to localhost:11434, then deploy to production and accidentally point the client at api.openai.com because someone copied the wrong config value.

If your application handles PII, financial records, or confidential business data, a configuration mistake that redirects traffic to the cloud API is a potential data breach.

Fix: make the LLM endpoint URL a required environment variable with no default. Fail loudly at startup if it's not set. Add a check that logs a warning if the configured URL is an external service for deployments marked as "private data". :::

:::warning Model keep_alive of -1 prevents memory for other workloads Setting keep_alive: -1 on a model keeps it in memory indefinitely. If your server also runs databases, application servers, or other memory-intensive workloads, permanently pinning a 5-8GB model in memory can cause OOM conditions for other processes.

Use a reasonable TTL like 30m for infrequently used models, and reserve permanent pinning (-1) only for the model that gets queried continuously. Alternatively, set OLLAMA_MAX_LOADED_MODELS=1 so Ollama automatically unloads the previous model when a different one is requested. :::

:::warning Modelfile SYSTEM prompts are not security boundaries If you create a model with SYSTEM "You are a safe assistant that never discusses X", users can still jailbreak the system prompt through prompt injection in their messages, or by sending API requests with the system field overriding your Modelfile default.

The SYSTEM instruction in a Modelfile sets a default system prompt, but anyone calling the API directly can override it with their own system message. A Modelfile is a convenience tool for default behavior, not an enforcement mechanism.

Fix: if you need to enforce safety constraints, implement them at the application layer: sanitize user inputs before sending to the model, check model outputs before returning to users, and log all requests for audit purposes. :::

:::warning Context window accumulation in chat applications causes slow drift In a chat application, you accumulate conversation history in the messages array and send the full history with each request. As the conversation grows, three things happen: (1) requests get slower because the model re-processes the full history, (2) memory usage grows as the KV cache expands, and (3) the model starts forgetting early instructions due to attention dilution.

Fix: implement a sliding window that keeps the system prompt, the last N turns, and optionally a summarized version of older history. A simple strategy: once the conversation exceeds 80% of num_ctx, summarize the oldest half and replace it with "Previous conversation summary: [summary]". :::

Interview Q&A

Q1: How does Ollama manage multiple models and what happens to memory when you switch between them?

Ollama uses a time-to-live (TTL) based model lifecycle. When a request comes in for a model, Ollama loads that model into RAM (and GPU VRAM if available) and starts a timer. After the TTL period (default 5 minutes) with no new requests, the model is unloaded - its memory is freed. If a new request comes in for an already-loaded model, the timer resets. This means on a machine with enough RAM for one model, Ollama automatically handles model switching: the old model unloads when idle and the new model loads when needed. The tradeoff is a "cold start" latency every time a model needs to load. For production systems with predictable traffic patterns, you can preload critical models with keep_alive: -1 to pin them permanently in memory. The OLLAMA_MAX_LOADED_MODELS environment variable sets a hard limit on simultaneous loaded models, preventing OOM conditions on constrained hardware.

Q2: What is the difference between Ollama's native API and its OpenAI-compatible API? When would you use each?

Ollama's native API (/api/chat, /api/generate) is designed around Ollama's feature set: it supports streaming via newline-delimited JSON chunks, has Ollama-specific fields like keep_alive and context for manual KV cache reuse, and its response format includes metadata like eval_count, eval_duration, and load_duration that OpenAI's API doesn't expose. The OpenAI-compatible API (/v1/chat/completions) is a translation layer that maps OpenAI's request/response format to Ollama's internals. Use the native API when you're building a new application specifically for Ollama - you get more control and richer metadata. Use the OpenAI-compatible API when you have existing code that uses the OpenAI SDK, when your team already knows the OpenAI API format, or when you want to swap between local Ollama and the real OpenAI API based on an environment variable.

Q3: How would you configure Ollama for a production deployment serving 50 internal users with mixed workloads - some doing long document analysis, some doing quick Q&A?

Several layers of configuration are needed. First, hardware: a server with at least 32GB RAM and an NVIDIA GPU with 24GB VRAM or Apple Silicon Mac with 64GB unified memory. Second, model selection: run at least two models - a capable 13-14B model at Q4_K_M for document analysis (loaded with keep_alive: -1) and a fast 3-7B model for quick Q&A. Third, Ollama configuration: set OLLAMA_NUM_PARALLEL=4 to handle up to 4 concurrent requests per model, OLLAMA_MAX_LOADED_MODELS=2 to keep both models in memory, and OLLAMA_MAX_QUEUE=100 to handle burst traffic. Fourth, context management: for document analysis workloads, set num_ctx to 16384 or higher per request, but for Q&A workloads use 4096. Fifth, put nginx in front for TLS, authentication, and rate limiting. Sixth, add monitoring: a health check endpoint that fires a test inference every 5 minutes, plus alerting when response latency exceeds a threshold. With continuous batching across 4 parallel slots, 50 internal users with typical request patterns (not all simultaneous) is very manageable.

Q4: A colleague says "Ollama just wraps llama.cpp, so it's always slower than running llama.cpp directly." Is this true?

Mostly false, with a narrow exception. Ollama does use llama.cpp as its inference backend, so the token generation speed of a running model is essentially the same - Ollama doesn't add significant overhead on the inference path. The HTTP request/response overhead adds a few milliseconds but is negligible relative to generation time. Where the statement has a grain of truth: Ollama abstracts away some fine-grained control over llama.cpp flags. For example, Ollama's GPU layer allocation is automatic and may not be perfectly optimal for your specific hardware/model combination, whereas llama-cli -ngl 28 lets you manually tune the split. A careful expert running llama.cpp directly might get 5-15% better throughput by tuning thread count, batch size, and GPU layer split for their specific setup. But for 95% of use cases, Ollama and direct llama.cpp perform identically. The management overhead savings of Ollama (no manual model file handling, automatic startup/restart, clean API) far outweigh the potential tiny performance difference.

Q5: How do you share a custom Modelfile configuration with a team so everyone uses the same model settings?

The Modelfile is a plain text file that can be committed to a git repository like any other configuration file. Store it at something like infra/ollama/Modelfile.support-bot. Include a setup script or Makefile target that runs ollama create support-bot -f infra/ollama/Modelfile.support-bot. When the Modelfile changes (new system prompt, adjusted parameters), team members pull the updated file and re-run ollama create - it overwrites the local model with the new configuration. For a CI/CD workflow: when the Modelfile changes, a pipeline stage can deploy the updated model to the shared Ollama server. The key properties that make this work are: Modelfiles are small (usually under 50 lines), human-readable, and deterministic - the same Modelfile always produces the same model configuration. The only non-deterministic element is if the base model version changes (for example if FROM llama3.2 pulls a newer version of llama3.2), which you can pin by specifying a digest: FROM llama3.2@sha256:abc123....

Q6: Explain the tradeoffs between running Ollama on a shared server vs. running it on each developer's laptop.

Shared server advantages: (1) everyone accesses the same models with the same configuration, eliminating "works on my machine" model version drift; (2) can afford a larger server with more RAM/VRAM, enabling better models than any individual laptop; (3) expensive models (70B) become accessible to everyone; (4) central logging for all queries useful for debugging and compliance. Shared server disadvantages: (1) internet/VPN dependency - if the server is unreachable, no AI tools; (2) latency over the network is higher than localhost, especially for streaming responses; (3) shared resources mean slower responses during peak usage; (4) server maintenance burden on someone. Laptop advantages: (1) works offline and while traveling; (2) no contention with other users; (3) data never leaves the device, satisfying strict data privacy requirements. Laptop disadvantages: (1) limited to models that fit in laptop RAM (typically 7B-13B); (2) setup required per machine; (3) model versions can drift between team members. Practical recommendation: run Ollama on each developer's machine for offline work and privacy-sensitive queries, and also maintain a shared server with larger models for tasks that need better quality or longer contexts.

The Team That Couldn't Agree on Python Environments​

Why This Exists - The Three Problems Before Ollama​

History - From Tiny Corp to the Go-To Local Inference Tool​

Core Concepts​

How Ollama Manages Models​

The Modelfile Format​

The REST API​

Setting Up Ollama - Installation and First Run​

Installation​

Pulling and Running Models​

Creating Custom Models with Modelfile​

Basic Custom Model​

Model from a Local GGUF File​

Code Review Bot​

Python Integration​

Using the Ollama Python Library​

Streaming Responses​

Generating Embeddings​

Using the OpenAI SDK with Ollama​

Building a Simple RAG Pipeline with Ollama​

Architecture Diagrams​

Setting Up the Full Local AI Stack: Ollama + Open WebUI​

Docker Compose Stack​

Without Docker (Direct Install)​

Nginx Configuration for Team Access​

Memory Management and Multi-Model Strategy​

Model Load/Unload Behavior​

VRAM Planning for Multiple Models​

Ollama vs. llama.cpp vs. LM Studio - When to Use Each​

Ollama​

llama.cpp (direct)​

LM Studio​

Production Engineering Notes​

Running Ollama as a Systemd Service​

Health Monitoring and Alerting​

Automated Model Updates​

Common Mistakes​

Interview Q&A​

Further Reading​

The Team That Couldn't Agree on Python Environments

Why This Exists - The Three Problems Before Ollama

History - From Tiny Corp to the Go-To Local Inference Tool

Core Concepts

How Ollama Manages Models

The Modelfile Format

The REST API

Setting Up Ollama - Installation and First Run

Installation

Pulling and Running Models

Creating Custom Models with Modelfile

Basic Custom Model

Model from a Local GGUF File

Code Review Bot

Python Integration

Using the Ollama Python Library

Streaming Responses

Generating Embeddings

Using the OpenAI SDK with Ollama

Building a Simple RAG Pipeline with Ollama

Architecture Diagrams

Setting Up the Full Local AI Stack: Ollama + Open WebUI

Docker Compose Stack

Without Docker (Direct Install)

Nginx Configuration for Team Access

Memory Management and Multi-Model Strategy

Model Load/Unload Behavior

VRAM Planning for Multiple Models

Ollama vs. llama.cpp vs. LM Studio - When to Use Each

Ollama

llama.cpp (direct)

LM Studio

Production Engineering Notes

Running Ollama as a Systemd Service

Health Monitoring and Alerting

Automated Model Updates

Common Mistakes

Interview Q&A

Further Reading