Module 2: Running Models Locally

Local inference has gone from an academic curiosity to a practical workflow tool. llama.cpp runs Llama 3 8B at 20-30 tokens/second on a MacBook Pro. Ollama wraps llama.cpp in a clean API that mirrors OpenAI's interface so existing code works without changes. The engineering friction of running open source models locally is now genuinely low.

Why run locally? Development iteration speed (no API call latency), zero cost per token, privacy for sensitive data, and the ability to test behavior before deploying. Most engineers working with open source models run locally for development and switch to vLLM or TGI for production serving.

The Local Inference Stack

Lessons in This Module

#	Lesson	Key Concept
1	llama.cpp Architecture and Setup	GGUF format, quantization levels, build and run
2	Ollama Deep Dive	Modelfile, API, model management, customization
3	LM Studio and Desktop Inference	GUI workflow, model discovery, local server
4	Hardware Requirements by Model Size	VRAM/RAM rules of thumb, CPU vs GPU inference
5	Context Length and Memory Tradeoffs	KV cache sizing locally, context window limits
6	Prompt Templates by Model Family	Llama, Mistral, ChatML - why templates matter
7	Benchmarking Local Inference Speed	Tokens/sec, time-to-first-token, comparing setups
8	Local Models in Production APIs	Running Ollama as a production API, load balancing

Key Concepts You Will Master

GGUF quantization levels - Q4_0, Q4_K_M, Q8_0 - the quality/speed/memory tradeoff at each level
Memory calculation for local inference - the formula: model params × bytes/param + KV cache overhead
Prompt template compliance - why using the wrong prompt format degrades model performance
CPU inference - when CPU-only inference is acceptable and how to optimize it
Ollama API compatibility - how to use Ollama as a drop-in replacement for OpenAI API calls

Prerequisites

macOS, Linux, or Windows with WSL2
8GB RAM minimum (16GB+ recommended for 7B models)
Optional: NVIDIA or Apple Silicon GPU for faster inference

The Local Inference Stack​

Lessons in This Module​

Key Concepts You Will Master​

Prerequisites​

The Local Inference Stack

Lessons in This Module

Key Concepts You Will Master

Prerequisites