Skip to main content

6 docs tagged with "benchmarks"

View all tags

Building an Evaluation Harness

Building a production evaluation harness for LLMs - lm-evaluation-harness architecture, custom task integration, CI/CD evaluation gates, versioned evaluation datasets, and automated regression detection.

Code Generation Evaluation

Evaluating LLMs on code generation tasks - HumanEval, MBPP, LiveCodeBench, SWE-bench, pass@k metric, EvalPlus, execution-based evaluation, security testing, and building sandboxed evaluation environments.

Long-Context Evaluation

Evaluating LLM long-context capability - the Needle in a Haystack test, RULER benchmark, lost-in-the-middle phenomenon, and measuring effective context utilization vs claimed context window size.

Open LLM Leaderboard and Benchmarks

Understanding the HuggingFace Open LLM Leaderboard, what each benchmark actually measures, how contamination distorts scores, and how to use leaderboard numbers to make real deployment decisions.

Reasoning and Math Evaluation

Evaluating LLM mathematical and logical reasoning - GSM8K, MATH, AIME benchmarks, chain-of-thought evaluation, process reward models, self-consistency voting, and measuring multi-step reasoning quality.

Task-Specific Evaluation Design

Building evaluation suites tailored to your production use case - test set curation, annotation, metric selection, LLM-as-judge, and automated scoring pipelines that actually predict deployment quality.