Benchmarking Local Model Performance
Measuring local LLM inference speed - tokens per second, time to first token, memory usage, and systematic comparison across quantization levels, models, and hardware configurations.
Measuring local LLM inference speed - tokens per second, time to first token, memory usage, and systematic comparison across quantization levels, models, and hardware configurations.
Running LLMs in Docker containers for reproducibility and deployment portability. NVIDIA Container Toolkit, Ollama and vLLM Docker images, multi-stage builds, and Docker Compose for a full local AI stack.
How to select hardware for running LLMs locally - VRAM and RAM requirements by model size, GPU tier comparison, Apple Silicon analysis, CPU-only inference feasibility, and a practical hardware selection matrix.
llama.cpp - Georgi Gerganov's C++ inference engine that runs quantized LLMs on CPUs and consumer GPUs. GGUF binary format, quantization types, performance tuning, and practical local inference.
LM Studio, Jan.ai, GPT4All, and Open WebUI for running LLMs locally - model discovery, hardware acceleration, local server mode, OpenAI-compatible APIs, and building a complete local AI development workspace.
Apple's MLX framework for running and fine-tuning LLMs on M-series chips - unified memory architecture, lazy evaluation, mlx-lm for inference, LoRA fine-tuning, and benchmarking against llama.cpp.
llama.cpp, Ollama, and LM Studio - run any open source model on your own hardware, understand memory requirements, and set up a local development environment.
Ollama - Docker-like CLI for running and managing local LLMs. Modelfile format, REST API, OpenAI-compatible endpoints, Python integration, and building a complete local AI stack.
Deploying LLMs in air-gapped environments without internet access - pre-downloading models, offline HuggingFace usage, regulatory compliance, and architecture for privacy-critical AI.