Skip to main content

64 docs tagged with "open-source-models"

View all tags

Advanced PEFT Methods

Beyond LoRA - Prefix Tuning, Prompt Tuning, IA3, AdaLoRA, VeRA, and LoftQ. When to reach for each method, how they compare on parameter count and quality, and practical implementation with the PEFT library.

AWQ In-Depth

How Activation-aware Weight Quantization protects salient weights to achieve near-lossless INT4 compression, and how to deploy AWQ models with AutoAWQ and vLLM.

Axolotl and TRL Training Frameworks

Using Axolotl and HuggingFace TRL for LoRA and QLoRA fine-tuning - configuration files, SFTTrainer, DPO training, and distributed multi-GPU fine-tuning setups.

Benchmarking Local Model Performance

Measuring local LLM inference speed - tokens per second, time to first token, memory usage, and systematic comparison across quantization levels, models, and hardware configurations.

Building an Evaluation Harness

Building a production evaluation harness for LLMs - lm-evaluation-harness architecture, custom task integration, CI/CD evaluation gates, versioned evaluation datasets, and automated regression detection.

Code and Math Specialized Models

How domain-specific pre-training and fine-tuning on code and math data produces models that outperform general LLMs on programming and reasoning tasks - and when to use them in production.

Code Generation Evaluation

Evaluating LLMs on code generation tasks - HumanEval, MBPP, LiveCodeBench, SWE-bench, pass@k metric, EvalPlus, execution-based evaluation, security testing, and building sandboxed evaluation environments.

Continual Learning and Domain Adaptation

Learn how to adapt open-source language models to specialized domains through continual pre-training, manage catastrophic forgetting with EWC and data mixing, and evaluate domain knowledge gain versus general capability loss.

Deploying Quantized Models in Production

End-to-end guide for production deployment of quantized LLMs - format selection, serving stack configuration, latency SLAs, A/B testing, quality monitoring, and rollback strategy.

Docker and Containerized Local Inference

Running LLMs in Docker containers for reproducibility and deployment portability. NVIDIA Container Toolkit, Ollama and vLLM Docker images, multi-stage builds, and Docker Compose for a full local AI stack.

Evaluating Fine-Tuned Models

Evaluation strategies for fine-tuned LLMs - held-out test sets, LLM-as-judge evaluation, perplexity measurement, task-specific benchmarks, and avoiding evaluation pitfalls.

Factuality and Hallucination Evaluation

Measuring hallucination rates in open-source LLMs - TruthfulQA, FActScore, RAGAs factuality, entity verification, and building automated hallucination detection pipelines for production RAG systems.

Fine-Tuning Cost and ROI Analysis

Making the business case for LLM fine-tuning - calculating GPU compute costs, estimating break-even against API pricing, and deciding when fine-tuning beats prompt engineering on ROI.

Fine-Tuning Hyperparameter Search

Systematic hyperparameter optimization for LLM fine-tuning - learning rate, batch size, epochs, LoRA rank, warmup schedules, and efficient search strategies with Optuna and WandB sweeps.

Full Fine-Tuning vs PEFT

Decision framework for choosing between full fine-tuning and parameter-efficient methods like LoRA and QLoRA - covering compute requirements, quality ceilings, catastrophic forgetting, and when each approach wins.

GPTQ In Depth

A deep technical walkthrough of the GPTQ algorithm - Optimal Brain Surgeon derivation, layer-by-layer quantization, group quantization, actorder, and practical deployment with AutoGPTQ and vLLM.

Hardware Requirements and Selection

How to select hardware for running LLMs locally - VRAM and RAM requirements by model size, GPU tier comparison, Apple Silicon analysis, CPU-only inference feasibility, and a practical hardware selection matrix.

HuggingFace Hub and Model Cards

Master the HuggingFace Hub as your primary interface for finding, evaluating, and deploying open-source models. Learn to read model cards, use the Hub API, and navigate 800k+ models efficiently.

Instruction Tuning at Scale

How to instruction-tune open-source models at production scale - covering the FLAN insight, dataset construction principles, scaling laws for instruction data, multi-node training setup, and a complete pipeline for fine-tuning Llama 3 8B on a 2-node A100 cluster.

Kubernetes and Auto-Scaling for LLMs

Deploy LLMs on Kubernetes with GPU scheduling, HPA and KEDA for autoscaling, MIG partitioning on A100/H100, and Karpenter for on-demand GPU node provisioning.

LLaMA Family Architecture

A deep dive into Meta's LLaMA model family - from LLaMA 1 through LLaMA 3.3 - covering RoPE embeddings, SwiGLU activation, RMSNorm, grouped query attention, and when to choose each variant.

llama.cpp and GGUF Format

llama.cpp - Georgi Gerganov's C++ inference engine that runs quantized LLMs on CPUs and consumer GPUs. GGUF binary format, quantization types, performance tuning, and practical local inference.

LM Studio and GUI Tools

LM Studio, Jan.ai, GPT4All, and Open WebUI for running LLMs locally - model discovery, hardware acceleration, local server mode, OpenAI-compatible APIs, and building a complete local AI development workspace.

Load Balancing and Request Routing

Load balancing strategies for LLM serving - prefix-aware routing for KV cache reuse, least-connections for variable-cost requests, model routing, circuit breakers, and building a production gateway.

Long-Context Evaluation

Evaluating LLM long-context capability - the Needle in a Haystack test, RULER benchmark, lost-in-the-middle phenomenon, and measuring effective context utilization vs claimed context window size.

LoRA Mathematics and Implementation

Learn how LoRA (Low-Rank Adaptation) decomposes weight updates into low-rank matrices, why this works mathematically, and how to implement it from scratch in PyTorch and with HuggingFace PEFT.

Merging and Model Soup Techniques

Combining multiple fine-tuned models without retraining - LoRA adapter merging, SLERP, TIES-merging, DARE, and MergeKit for production model merging that unlocks capabilities no single training run achieves.

Mistral and Mixtral Architecture

Mistral 7B's sliding window attention and grouped query attention innovations, and Mixtral 8x7B's Mixture of Experts design - sparse routing, expert selection, and why MoE delivers 70B quality at 13B active parameter cost.

MLX for Apple Silicon

Apple's MLX framework for running and fine-tuning LLMs on M-series chips - unified memory architecture, lazy evaluation, mlx-lm for inference, LoRA fine-tuning, and benchmarking against llama.cpp.

Model Licensing and Compliance

Open-source model licenses are not all the same. Learn Apache 2.0, LLaMA Community, RAIL, and custom licenses - what you can and cannot do in production, and how to build a compliance workflow.

Model Versioning and Canary Releases

Managing model versions in production LLM serving - semantic versioning for models, canary deployments, A/B testing, shadow mode evaluation, rollback procedures, and blue-green model deployments.

Module 2: Running Models Locally

llama.cpp, Ollama, and LM Studio - run any open source model on your own hardware, understand memory requirements, and set up a local development environment.

Module 3: LoRA and QLoRA Fine-Tuning

Fine-tune any open source model on your data without owning a data center - LoRA theory, QLoRA 4-bit training, hyperparameter selection, and getting a specialized model into production.

Module 4: Quantization in Practice

GGUF, GPTQ, AWQ, and bitsandbytes - compress models to fit your hardware budget while understanding exactly what quality you are trading away and why.

Module 5: Fine-Tuning Pipelines

Production fine-tuning with Axolotl - dataset formatting, multi-GPU training, DPO preference tuning, and managing adapter versions across model releases.

Module 6: Evaluating Open Models

Build eval suites that give real signal - benchmark contamination, domain-specific evaluation, LLM-as-judge for open models, and regression testing after fine-tuning.

Monitoring and Debugging Fine-Tuning

How to monitor LLM fine-tuning runs and debug failures - tracking loss curves, gradient norms, GPU utilization, MFU, and diagnosing NaN loss, overfitting, and OOM errors in LoRA and full fine-tuning.

Monitoring LLM Services

Production observability for LLM serving systems - GPU metrics, TTFT, inter-token latency, vLLM Prometheus integration, distributed tracing, alerting, and Grafana dashboards.

Multi-Model Serving Architecture

Serving multiple LLMs from shared infrastructure - model routing, MIG partitioning, dynamic loading, LiteLLM proxy, cost optimization through bin-packing, and autoscaling per model in production.

Multimodal Open Source Models

How open-source vision-language models work - from CLIP vision encoders and projection layers to LLaVA, InternVL2, and LLaMA 3.2 Vision - and how to deploy them for document understanding, OCR, and visual reasoning in production.

Ollama and Local Model Management

Ollama - Docker-like CLI for running and managing local LLMs. Modelfile format, REST API, OpenAI-compatible endpoints, Python integration, and building a complete local AI stack.

Open LLM Leaderboard and Benchmarks

Understanding the HuggingFace Open LLM Leaderboard, what each benchmark actually measures, how contamination distorts scores, and how to use leaderboard numbers to make real deployment decisions.

Open Source Models

Run, fine-tune, quantize, evaluate, and deploy open source LLMs in production - the complete hands-on guide for engineers who want to own their models.

Phi and Small Language Models

Microsoft Phi model family - textbook quality data hypothesis, how 1-4B models can match much larger ones on reasoning tasks, and the design principles behind efficient small language models.

Post-Training Quantization Methods

A practical guide to PTQ methods for LLMs - GPTQ, AWQ, SmoothQuant, bitsandbytes, GGUF, and HQQ compared by accuracy, speed, memory, and production use case.

Privacy and Air-Gapped Deployment

Deploying LLMs in air-gapped environments without internet access - pre-downloading models, offline HuggingFace usage, regulatory compliance, and architecture for privacy-critical AI.

QLoRA: 4-Bit Fine-Tuning

Learn how QLoRA combines 4-bit NF4 quantization, double quantization, and paged optimizers to fine-tune 65B parameter models on a single GPU - covering the math, implementation, and production engineering.

Quantization Benchmarking

How to rigorously evaluate quantization quality using perplexity, downstream task accuracy, latency, and memory metrics - and build a complete benchmarking pipeline comparing FP16 vs GPTQ vs AWQ vs NF4.

Quantization Error Debugging

How to diagnose and fix quantization quality degradation - symptoms, root causes, diagnostic tools, and systematic fixes for INT4/INT8 quantized LLMs.

Quantization for Vision Models

How to quantize CNN and ViT vision models and vision-language models - handling batch norm sensitivity, attention outliers, and the strategy of quantizing the LLM backbone while keeping the vision encoder in FP16.

Quantization-Aware Training

When post-training quantization is not enough - how QAT simulates quantization noise during training so models learn to be robust to it, covering the straight-through estimator, QLoRA, and BitNet.

Qwen, DeepSeek, and International Models

Alibaba Qwen and DeepSeek architectural innovations - MLA attention, DeepSeekMoE, multi-token prediction, and how Chinese labs are advancing open-source LLM research.

Rate Limiting and Cost Control

Controlling costs and preventing abuse in LLM API serving - token-based rate limiting, Redis token buckets, tenant isolation, cost attribution, budget alerts, and abuse detection.

Reasoning and Math Evaluation

Evaluating LLM mathematical and logical reasoning - GSM8K, MATH, AIME benchmarks, chain-of-thought evaluation, process reward models, self-consistency voting, and measuring multi-step reasoning quality.

RLHF and DPO for Open Models

Learn how to align open-source language models with human preferences using RLHF and the simpler, more stable Direct Preference Optimization (DPO) approach with TRL.

Safety and Bias Evaluation

Evaluating open-source models for safety and bias before production deployment - red-teaming, toxicity measurement, demographic bias benchmarks, jailbreak robustness, and building end-to-end safety evaluation pipelines.

Selecting Target Modules and Rank

Which layers to apply LoRA to and what rank to use - two of the most impactful fine-tuning decisions. Covers attention vs FFN targeting, rank selection from r=4 to r=64, RSLoRA, DoRA, LoRA+, and ablation strategies.

Synthetic Data and Self-Improvement

Generating high-quality synthetic training data with LLMs using Evol-Instruct, Self-Instruct, Constitutional AI, rejection sampling, and self-play techniques to build data flywheels without expensive human annotation.

Task-Specific Evaluation Design

Building evaluation suites tailored to your production use case - test set curation, annotation, metric selection, LLM-as-judge, and automated scoring pipelines that actually predict deployment quality.

TGI and Alternative Serving Frameworks

Compare HuggingFace TGI, Ollama, LiteLLM, Triton Inference Server, and llama.cpp for LLM deployment - feature analysis, performance benchmarks, and when to use each framework.

Training Data Preparation for Fine-Tuning

Building high-quality data pipelines for LoRA fine-tuning - chat templates, instruction masking, deduplication, quality filtering, synthetic data generation, and dataset formats that actually produce good models.

vLLM Architecture and Deployment

Deploy open LLMs at production scale using vLLM - PagedAttention, continuous batching, tensor parallelism, and OpenAI-compatible serving for LLaMA 3 70B and beyond.