Module 2: Running Models Locally
Local inference has gone from an academic curiosity to a practical workflow tool. llama.cpp runs Llama 3 8B at 20-30 tokens/second on a MacBook Pro. Ollama wraps llama.cpp in a clean API that mirrors OpenAI's interface so existing code works without changes. The engineering friction of running open source models locally is now genuinely low.
Why run locally? Development iteration speed (no API call latency), zero cost per token, privacy for sensitive data, and the ability to test behavior before deploying. Most engineers working with open source models run locally for development and switch to vLLM or TGI for production serving.
The Local Inference Stack
Lessons in This Module
| # | Lesson | Key Concept |
|---|---|---|
| 1 | llama.cpp Architecture and Setup | GGUF format, quantization levels, build and run |
| 2 | Ollama Deep Dive | Modelfile, API, model management, customization |
| 3 | LM Studio and Desktop Inference | GUI workflow, model discovery, local server |
| 4 | Hardware Requirements by Model Size | VRAM/RAM rules of thumb, CPU vs GPU inference |
| 5 | Context Length and Memory Tradeoffs | KV cache sizing locally, context window limits |
| 6 | Prompt Templates by Model Family | Llama, Mistral, ChatML - why templates matter |
| 7 | Benchmarking Local Inference Speed | Tokens/sec, time-to-first-token, comparing setups |
| 8 | Local Models in Production APIs | Running Ollama as a production API, load balancing |
Key Concepts You Will Master
- GGUF quantization levels - Q4_0, Q4_K_M, Q8_0 - the quality/speed/memory tradeoff at each level
- Memory calculation for local inference - the formula: model params × bytes/param + KV cache overhead
- Prompt template compliance - why using the wrong prompt format degrades model performance
- CPU inference - when CPU-only inference is acceptable and how to optimize it
- Ollama API compatibility - how to use Ollama as a drop-in replacement for OpenAI API calls
Prerequisites
- macOS, Linux, or Windows with WSL2
- 8GB RAM minimum (16GB+ recommended for 7B models)
- Optional: NVIDIA or Apple Silicon GPU for faster inference
© 2026 EngineersOfAI. All rights reserved.
