Skip to main content

Module 2: Running Models Locally

Local inference has gone from an academic curiosity to a practical workflow tool. llama.cpp runs Llama 3 8B at 20-30 tokens/second on a MacBook Pro. Ollama wraps llama.cpp in a clean API that mirrors OpenAI's interface so existing code works without changes. The engineering friction of running open source models locally is now genuinely low.

Why run locally? Development iteration speed (no API call latency), zero cost per token, privacy for sensitive data, and the ability to test behavior before deploying. Most engineers working with open source models run locally for development and switch to vLLM or TGI for production serving.

The Local Inference Stack

Lessons in This Module

#LessonKey Concept
1llama.cpp Architecture and SetupGGUF format, quantization levels, build and run
2Ollama Deep DiveModelfile, API, model management, customization
3LM Studio and Desktop InferenceGUI workflow, model discovery, local server
4Hardware Requirements by Model SizeVRAM/RAM rules of thumb, CPU vs GPU inference
5Context Length and Memory TradeoffsKV cache sizing locally, context window limits
6Prompt Templates by Model FamilyLlama, Mistral, ChatML - why templates matter
7Benchmarking Local Inference SpeedTokens/sec, time-to-first-token, comparing setups
8Local Models in Production APIsRunning Ollama as a production API, load balancing

Key Concepts You Will Master

  • GGUF quantization levels - Q4_0, Q4_K_M, Q8_0 - the quality/speed/memory tradeoff at each level
  • Memory calculation for local inference - the formula: model params × bytes/param + KV cache overhead
  • Prompt template compliance - why using the wrong prompt format degrades model performance
  • CPU inference - when CPU-only inference is acceptable and how to optimize it
  • Ollama API compatibility - how to use Ollama as a drop-in replacement for OpenAI API calls

Prerequisites

  • macOS, Linux, or Windows with WSL2
  • 8GB RAM minimum (16GB+ recommended for 7B models)
  • Optional: NVIDIA or Apple Silicon GPU for faster inference
© 2026 EngineersOfAI. All rights reserved.