21 docs tagged with "evaluation"

Benchmarks: MMLU, HumanEval, and HELM

Navigate the LLM benchmark ecosystem - what each benchmark actually measures, saturation, contamination, and how to build benchmarks that can't be gamed.

BLEU, ROUGE, and Generation Metrics

Master reference-based generation metrics - BLEU, ROUGE, BERTScore, BLEURT - and know exactly when each one lies to you.

Building an Evaluation Harness

Building a production evaluation harness for LLMs - lm-evaluation-harness architecture, custom task integration, CI/CD evaluation gates, versioned evaluation datasets, and automated regression detection.

Code Generation Evaluation

Evaluating LLMs on code generation tasks - HumanEval, MBPP, LiveCodeBench, SWE-bench, pass@k metric, EvalPlus, execution-based evaluation, security testing, and building sandboxed evaluation environments.

Cross-Validation

A comprehensive guide to cross-validation - k-Fold, stratified, repeated, LOOCV, group CV, time-series CV, nested CV, and common pitfalls including data leakage.

Evaluation Metrics for Regression

A comprehensive guide to regression evaluation - MAE, MSE, RMSE, R², MAPE, Huber loss, residual diagnostics, business-aligned metrics, and production monitoring patterns.

Factuality and Hallucination Evaluation

Measuring hallucination rates in open-source LLMs - TruthfulQA, FActScore, RAGAs factuality, entity verification, and building automated hallucination detection pipelines for production RAG systems.

Human Evaluation

Design rigorous human evaluation studies for LLMs - from annotation protocols to inter-annotator agreement to Chatbot Arena methodology.

LLM-as-Judge

Use powerful LLMs to automatically evaluate other models - with position bias mitigation, CoT judging, and cost analysis.

Long-Context Evaluation

Evaluating LLM long-context capability - the Needle in a Haystack test, RULER benchmark, lost-in-the-middle phenomenon, and measuring effective context utilization vs claimed context window size.

Module 06: LLM Evaluation

A complete guide to evaluating large language models - from perplexity to production monitoring.

Module 6: Evaluating Open Models

Build eval suites that give real signal - benchmark contamination, domain-specific evaluation, LLM-as-judge for open models, and regression testing after fine-tuning.

Open LLM Leaderboard and Benchmarks

Understanding the HuggingFace Open LLM Leaderboard, what each benchmark actually measures, how contamination distorts scores, and how to use leaderboard numbers to make real deployment decisions.

Perplexity and Language Model Metrics

Understand perplexity, cross-entropy, bits per byte, and when intrinsic metrics mislead you about model quality.

Production Monitoring for LLMs

Build a comprehensive production monitoring stack for LLMs - latency, cost, quality drift, safety, and observability platforms compared.

RAG Evaluation Metrics

Evaluate RAG systems with precision - the RAG triad, RAGAS framework, golden datasets, and retrieval metrics for production pipelines.

Reasoning and Math Evaluation

Evaluating LLM mathematical and logical reasoning - GSM8K, MATH, AIME benchmarks, chain-of-thought evaluation, process reward models, self-consistency voting, and measuring multi-step reasoning quality.

Safety and Bias Evaluation

Evaluate LLMs for harmful outputs, social bias, hallucination, and jailbreak vulnerability - including red teaming methodology and production monitoring.

Safety and Bias Evaluation

Evaluating open-source models for safety and bias before production deployment - red-teaming, toxicity measurement, demographic bias benchmarks, jailbreak robustness, and building end-to-end safety evaluation pipelines.

Task-Specific Evaluation Design

Building evaluation suites tailored to your production use case - test set curation, annotation, metric selection, LLM-as-judge, and automated scoring pipelines that actually predict deployment quality.

Train / Validation / Test Split Strategy

A deep dive into data splitting - why the split matters, how to partition data correctly, data leakage patterns, temporal splits, group splits, and production-grade evaluation design.