9 docs tagged with "running-locally"

Benchmarking Local Model Performance

Measuring local LLM inference speed - tokens per second, time to first token, memory usage, and systematic comparison across quantization levels, models, and hardware configurations.

Docker and Containerized Local Inference

Running LLMs in Docker containers for reproducibility and deployment portability. NVIDIA Container Toolkit, Ollama and vLLM Docker images, multi-stage builds, and Docker Compose for a full local AI stack.

Hardware Requirements and Selection

How to select hardware for running LLMs locally - VRAM and RAM requirements by model size, GPU tier comparison, Apple Silicon analysis, CPU-only inference feasibility, and a practical hardware selection matrix.

llama.cpp and GGUF Format

llama.cpp - Georgi Gerganov's C++ inference engine that runs quantized LLMs on CPUs and consumer GPUs. GGUF binary format, quantization types, performance tuning, and practical local inference.

LM Studio and GUI Tools

LM Studio, Jan.ai, GPT4All, and Open WebUI for running LLMs locally - model discovery, hardware acceleration, local server mode, OpenAI-compatible APIs, and building a complete local AI development workspace.

MLX for Apple Silicon

Apple's MLX framework for running and fine-tuning LLMs on M-series chips - unified memory architecture, lazy evaluation, mlx-lm for inference, LoRA fine-tuning, and benchmarking against llama.cpp.

Module 2: Running Models Locally

llama.cpp, Ollama, and LM Studio - run any open source model on your own hardware, understand memory requirements, and set up a local development environment.

Ollama and Local Model Management

Ollama - Docker-like CLI for running and managing local LLMs. Modelfile format, REST API, OpenAI-compatible endpoints, Python integration, and building a complete local AI stack.

Privacy and Air-Gapped Deployment

Deploying LLMs in air-gapped environments without internet access - pre-downloading models, offline HuggingFace usage, regulatory compliance, and architecture for privacy-critical AI.