Open LLM Leaderboard and Benchmarks

The Monday Morning Meeting

It is Monday morning. Your team has been asked to pick an open-source model to replace a proprietary API that costs $40,000 per month. You have three weeks and a budget for two A100s. Your manager pulls up the HuggingFace Open LLM Leaderboard and points at a model ranked third overall. "This one looks good," he says. "Strong MMLU, solid ARC. Ship it by Friday."

You smile. You nod. And then you spend the next four days debugging why a model with a 72.4 MMLU score produces answers that make no sense for your document processing pipeline. The model aces college-level biology trivia. It cannot reliably extract a vendor name from a purchase order.

This scenario plays out dozens of times a week at companies deploying open-source models. Engineers treat leaderboard scores the way non-engineers treat university rankings: as a proxy for quality that is easy to cite but only loosely correlated with what actually matters. The leaderboard is real. The benchmarks are real. The gap between benchmark performance and production performance is also real, and ignoring it is expensive.

The goal of this lesson is not to dismiss benchmarks. They encode genuine research effort and they do measure something meaningful. The goal is to give you a precise understanding of what each benchmark measures, what its failure modes are, and how to combine leaderboard information with your own evaluation signal to make a deployment decision you can defend.

By the end of this lesson you will be able to reproduce leaderboard scores locally using lm-evaluation-harness, explain exactly why a model can score 75% on MMLU while failing your production task, and know when to trust a leaderboard number and when to throw it out entirely.

Why This Exists

The Problem Before Standard Benchmarks

Before 2021, comparing open-source language models was chaos. Every research paper used different test sets, different prompting styles, different evaluation code. The same model could appear to score 60% on one team's version of a dataset and 70% on another team's version. There was no shared infrastructure. Reproducing a paper's numbers often required reading two paragraphs of footnotes and then writing evaluation code from scratch.

The practical consequence was that practitioners had no signal. If you wanted to deploy a model, you had to evaluate it yourself. For small teams this meant either spending weeks building evaluation infrastructure or skipping evaluation entirely and hoping the model worked. Most teams skipped evaluation.

The research consequence was equally bad. Paper authors could cherry-pick evaluation setups to make their model look best. This is not malice - it is incentive. If your evaluation code is slightly different from the baseline's evaluation code, your model looks better. Publish or perish applies to benchmark numbers too.

What the Leaderboard Solves

The HuggingFace Open LLM Leaderboard, launched in mid-2023, solved the reproducibility problem by standardizing evaluation infrastructure. Every model on the leaderboard is evaluated with the same codebase (lm-evaluation-harness), the same prompts, the same normalization, and the same hardware. If model A scores 72.4 on MMLU and model B scores 71.9, those numbers mean the same thing in the same conditions.

This is genuinely valuable. Before the leaderboard, a 0.5 point difference in MMLU scores between two papers was meaningless - it could be evaluation noise. On the leaderboard, with standardized evaluation, that difference is at least consistently measured noise.

The leaderboard also created a shared language. When you say "this model scores 75 on MMLU," every ML engineer in the world knows exactly what protocol you mean. That shared language accelerates decision making, even when the underlying signal is imperfect.

Why It Also Created New Problems

Standardized benchmarks create standardized optimization targets. Once the leaderboard became the primary signal for "which model is best," model developers started optimizing for leaderboard scores directly. Some of this optimization is legitimate - better training produces better benchmark scores because the model is genuinely smarter. Some of it is not legitimate: training on data that overlaps with test sets, selecting checkpoints that maximize leaderboard scores rather than general capability, and running hundreds of fine-tuning experiments until one happens to score well on the specific benchmark distribution.

This is the benchmark contamination and gaming problem. It is not hypothetical. It is systematic. And it is the reason why you cannot use leaderboard scores alone to make a deployment decision.

Historical Context

The Benchmark Lineage

MMLU (Massive Multitask Language Understanding) was introduced by Dan Hendrycks and colleagues at UC Berkeley in 2020. The "aha moment" came when Hendrycks realized that existing NLP benchmarks tested narrow linguistic skills but not the kind of knowledge a capable assistant would actually need. He assembled 57 subjects - from abstract algebra to world religions - by collecting questions from real academic exams and professional licensing tests. The result was a benchmark that tested whether models had internalized the kind of structured knowledge found in textbooks.

HellaSwag was published by Zellers et al. in 2019. The insight was that existing commonsense benchmarks were too easy for the models of that era - BERT was already scoring near human performance. Zellers et al. used a technique called adversarial filtering: generate candidate wrong answers using a language model, then filter out the ones that are "obviously wrong" to a discriminator model. The remaining wrong answers are the ones that look plausible but are factually incorrect - exactly the failure mode you care about in production.

ARC (AI2 Reasoning Challenge) was published by Clark et al. at the Allen Institute for AI in 2018. The Challenge Set specifically contains questions that simple retrieval methods and statistical shortcuts cannot answer correctly. It was designed to test genuine multi-step reasoning rather than pattern matching.

WinoGrande was published by Sakaguchi et al. in 2019, scaling up the original Winograd Schema Challenge. Winograd schemas are pronoun disambiguation problems that require commonsense knowledge about the world to resolve. "The trophy didn't fit in the suitcase because it was too big" - what does "it" refer to? Humans answer instantly. Models that rely on surface statistics struggle.

GSM8K (Grade School Math 8K) was published by Cobbe et al. at OpenAI in 2021. It contains 8,500 grade school math word problems requiring multi-step arithmetic reasoning. The key insight was that these problems require the model to maintain state across reasoning steps, not just recall a fact.

HumanEval was published by Chen et al. at OpenAI in 2021, alongside the Codex paper. It contains 164 Python programming problems with unit tests. It evaluates pass@k: the probability that at least one of k generated solutions passes the unit tests.

TruthfulQA was published by Lin et al. in 2021. The key insight was that larger models are often less truthful because they are better at generating plausible-sounding falsehoods that match the statistical patterns of human writing. TruthfulQA specifically targets questions where humans are commonly wrong due to misconceptions - "What happens if you swallow a watermelon seed?" - to test whether models parrot misconceptions or state correct information.

Core Concepts: What Each Benchmark Actually Measures

MMLU: Knowledge Breadth, Not Intelligence

MMLU presents 4-choice multiple choice questions across 57 subjects. The evaluation protocol is zero-shot or 5-shot. The model must output the probability of tokens "A", "B", "C", or "D" after the question prompt, and the highest-probability token is taken as the answer.

What MMLU actually measures: the breadth of factual and conceptual knowledge encoded in the model's weights. A model that was trained on a large fraction of textbooks and Wikipedia will score well. A model that was not will score poorly, regardless of reasoning ability.

What MMLU does not measure: reasoning under uncertainty, instruction following, generation quality, factual consistency in generation, or anything requiring multi-turn interaction.

The mathematical formulation is simple. For a question with answer $a^* \in \{A, B, C, D\}$ , the model is scored as correct if:

$\hat{a} = \arg\max_{a \in \{A,B,C,D\}} P_\theta(\text{token}(a) \mid \text{prompt})$

and $\hat{a} = a^*$ . The final score is the percentage of questions answered correctly across all 57 subjects.

Limitation 1: Multiple choice is not generation. Real tasks require generating text, not selecting from four options. A model can score 75% on MMLU while producing hallucinated outputs in generation mode.

Limitation 2: Subject coverage is uneven. 57 subjects sounds comprehensive but some are densely represented (medicine, law) and others barely present. If your use case is in a sparse subject, MMLU tells you almost nothing.

Limitation 3: The 5-shot format leaks information. The 5 in-context examples teach the model the answer format. Models that are better at following this specific format score higher independent of actual knowledge.

HellaSwag: Commonsense Completion

HellaSwag presents a partial description of an activity and four possible completions. The model must pick the most plausible continuation. For example: "A man is washing a car. He rinses the soap off. He..." - which completion makes sense?

What HellaSwag actually measures: the model's learned understanding of how physical and social activities unfold in time. It captures the statistical patterns of everyday human activity.

What HellaSwag does not measure: abstract reasoning, domain-specific knowledge, or anything outside common everyday activities. A model that scores 90% on HellaSwag can still fail completely on technical commonsense.

Limitation: Adversarial filtering trained on older models. The wrong answers were designed to fool BERT-era models. Modern large language models learned these specific patterns. Scores above 85% on HellaSwag are likely saturating on the benchmark's specific adversarial construction rather than improving at genuine commonsense.

ARC-Challenge: Multi-Step Science Reasoning

ARC-Challenge contains 1,172 multiple choice science questions at the grade 3-9 level that retrieval-based and statistical methods fail to answer correctly.

What ARC actually measures: the ability to apply scientific principles to novel situations that require chaining facts together. It tests whether the model can use knowledge it has, not just retrieve it.

Limitation: Grade school scope. These are questions a well-trained 12-year-old can answer. High ARC-Challenge scores do not predict performance on graduate-level scientific reasoning. The ceiling is low.

WinoGrande: Pronoun Disambiguation Under Commonsense Constraints

WinoGrande presents sentences with ambiguous pronouns that require world knowledge to resolve. The evaluation uses partial scoring: the model must assign higher probability to the sentence with the correct referent than to the sentence with the incorrect referent.

What WinoGrande actually measures: lexical and commonsense binding constraints - whether the model understands which physical or social facts constrain pronoun reference.

Limitation: Narrow linguistic format. Real text does not present pronoun disambiguation problems in this structured format. High WinoGrande scores mean the model is good at this specific task, not that it has robust commonsense reasoning in general.

GSM8K: Grade School Math With Chain of Thought

GSM8K evaluates pass rates on word problems like "Janet's ducks lay 16 eggs per day. She eats three for breakfast and bakes four into muffins every day. She sells the remainder for $2 each. How much does she make per day at the farmers market?"

What GSM8K actually measures: the ability to decompose a multi-step arithmetic problem into a sequence of correct intermediate calculations. Chain of thought prompting is standard for this benchmark.

The pass@1 metric is: $\text{pass@1} = \frac{1}{N}\sum_{i=1}^{N}\mathbf{1}[\text{answer}_i = \text{gold}_i]$ .

Limitation: "Grade school" is misleading. GSM8K requires multi-step reasoning that many models fail. But it does not test symbolic manipulation, proof construction, or any advanced math. A model scoring 80% on GSM8K can still fail simple algebra word problems with unusual framing.

Limitation: Answer extraction is fragile. Evaluation usually extracts the last number from the generated chain of thought. Models can produce correct reasoning but format the final answer incorrectly and score 0 for that example.

HumanEval: Functional Correctness for Code

HumanEval uses 164 Python programming problems. Each problem provides a function signature and docstring. The model generates a completion. The evaluation runs unit tests against the generated code.

The canonical metric is pass@k, estimated via:

$\text{pass@k} = \mathbb{E}_{\text{problems}}\left[1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}\right]$

where $n$ is the number of samples per problem and $c$ is the number that pass tests.

Limitation: 164 problems is a small test set. Random variance is high. A 2% difference in HumanEval scores between two models is not statistically significant.

Limitation: Python-only and introductory-level. HumanEval does not test multi-file projects, debugging, or any language other than Python. Production code tasks are much harder.

TruthfulQA: Resistance to Plausible Falsehoods

TruthfulQA presents 817 questions designed to elicit falsehoods that many humans believe. It is evaluated in two modes: MC (multiple choice) and generation (where a classifier judges truthfulness and informativeness).

What TruthfulQA actually measures: whether models parrot common misconceptions versus stating accurate information when the accurate information conflicts with popular belief.

Limitation: The misconceptions are curated for English-speaking Western audiences. Domain and cultural coverage is narrow.

Limitation: The generation evaluation requires a classifier that may itself be unreliable. TruthfulQA generation scores vary significantly based on which judge model you use.

The Benchmark Contamination Problem

How Contamination Happens

A model trained on data scraped from the internet has a significant probability of having seen benchmark test questions. Common Crawl and other large web corpora contain GitHub repositories, Stack Overflow answers, blog posts, and educational sites - many of which contain MMLU questions, ARC questions, or even direct copies of benchmark test sets.

If the model's training data contains the answer to question 47 of the MMLU abstract algebra test, the model is not demonstrating algebra knowledge when it answers that question correctly - it is demonstrating memorization. The benchmark score is inflated.

Contamination is not always intentional. Data pipelines that scrape the web broadly will pick up benchmark data. Some model developers do decontaminate (remove known benchmark examples from training data), but decontamination is imperfect. Near-duplicate detection catches exact copies but not paraphrases.

Measuring Contamination

One way to detect contamination is to compare a model's performance on known benchmark questions versus perturbed versions of the same questions. If the model drops significantly when question wording is changed slightly, that suggests memorization rather than reasoning.

Another approach is to evaluate on held-out benchmark variants that were not public during training. The Open LLM Leaderboard v2 introduced harder benchmarks partly for this reason - by using benchmarks that were less circulated at training time.

The Gaming Problem

Beyond accidental contamination, there is deliberate benchmark optimization. Since the leaderboard uses a fixed evaluation protocol, developers can:

Run thousands of fine-tuning experiments and select the checkpoint that scores highest on the leaderboard benchmarks.
Fine-tune on examples that are stylistically similar to benchmark questions even if not identical.
Optimize chat templates and prompting to maximize performance on the specific 0-shot or 5-shot format used by lm-evaluation-harness.

None of these actions improve the model's general capability. They specifically improve leaderboard scores. The result is models that rank highly on the leaderboard but underperform on deployment tasks that differ from the benchmark distribution.

Open LLM Leaderboard v2: Harder Benchmarks

HuggingFace launched a substantially revised leaderboard in 2024. The v2 benchmarks were chosen to be harder, less contaminated, and more predictive of genuine capability.

GPQA (Graduate-Level Google-Proof Q&A): 448 expert-authored questions in biology, physics, and chemistry. These are questions that PhDs in the relevant field answer correctly 65% of the time. Graduate students with internet access score around 34%. Most current models score around 30-50% depending on scale. This benchmark is deliberately hard to contaminate because the questions require expert reasoning rather than fact recall.

MUSR (Multi-Step Soft Reasoning): Tests multi-step reasoning chains in long-context settings - murder mysteries, object placement, team allocation. Requires maintaining complex state across long context.

MATH (Hendrycks Math): 12,500 competition math problems across 7 difficulty levels. Problems require symbolic manipulation, not just arithmetic. Average human on AMC-level problems scores around 40%. Current frontier models score 50-80% depending on scale and fine-tuning.

IFEval (Instruction Following Evaluation): Tests whether models follow explicit formatting and constraint instructions - "write a response in at least 200 words that does not include the word 'however'". Measures instruction adherence directly.

BBH (BIG-Bench Hard): 23 tasks specifically selected because GPT-3 scored near or below random on them. Tests algorithmic reasoning, formal logic, and tasks with multi-step structure.

The v2 leaderboard correlates better with human preference evaluations (like LMSYS Chatbot Arena) than v1 did. But the gap between leaderboard performance and production performance remains significant.

How to Actually Use Leaderboard Scores

The Correlation Question

Before treating any leaderboard score as a signal for your use case, ask: what is the correlation between this benchmark and my task?

MMLU correlates with general knowledge retrieval tasks. If you are building a question-answering system over a knowledge base, MMLU is a reasonable proxy. If you are building a code assistant, MMLU tells you almost nothing.

GSM8K correlates with multi-step reasoning tasks. If your application involves any kind of numerical computation or step-by-step problem solving, GSM8K is a useful signal.

HumanEval correlates with code generation tasks. If you are deploying a code assistant, HumanEval is directly relevant, though 164 problems is a small sample.

Domain-Specific Interpretation

The leaderboard averages across tasks. Two models with the same average score can have very different profiles. Model A might score 80% on MMLU and 40% on HumanEval. Model B might score 55% on MMLU and 75% on HumanEval. They average the same but are suited for completely different use cases.

Always look at per-benchmark scores, not just the aggregate. Then map each benchmark to your use case domain.

The Floor and Ceiling Problem

Benchmarks have floors and ceilings. For 4-choice multiple choice, random guessing achieves 25%. A model scoring 30% has almost no useful signal above chance. For HumanEval, a 20% score means only 1 in 5 simple Python problems are solved correctly - that model is not suitable for a code generation application.

Conversely, benchmarks saturate. Models scoring above 85% on HellaSwag are at or near saturation. The benchmark can no longer differentiate between models at that capability level.

What the Leaderboard Cannot Tell You

The leaderboard cannot tell you:

How the model performs on your specific data distribution
How the model handles long documents (benchmarks use short contexts)
Whether the model follows your specific instruction format
How latency and throughput compare at your serving scale
Whether the model's outputs are safe for your application domain
How the model degrades on out-of-distribution inputs

These gaps are not failures of the leaderboard. They are inherent limitations of any standardized benchmark. The leaderboard tells you one thing: how the model performs on these specific tasks under these specific conditions. Everything else requires your own evaluation.

Mermaid Diagrams

Benchmark Coverage Map

Contamination and Gaming Pipeline

How to Interpret a Leaderboard Score

Code: Running lm-evaluation-harness Locally

Installation

# Clone the evaluation harness
git clone https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness

# Install with HuggingFace support
pip install -e ".[vllm]"
# or for basic huggingface transformers support:
pip install -e .

Running a Single Benchmark

# Run MMLU on a local model - this reproduces the leaderboard setup
lm_eval \
  --model hf \
  --model_args pretrained=mistralai/Mistral-7B-v0.1,dtype=float16 \
  --tasks mmlu \
  --num_fewshot 5 \
  --batch_size 8 \
  --output_path ./results/mistral-7b-mmlu \
  --device cuda:0

# Run multiple benchmarks in one pass
lm_eval \
  --model hf \
  --model_args pretrained=mistralai/Mistral-7B-v0.1,dtype=float16 \
  --tasks mmlu,arc_challenge,hellaswag,winogrande,gsm8k,truthfulqa_mc1 \
  --num_fewshot 5,25,10,5,5,0 \
  --batch_size 4 \
  --output_path ./results/mistral-7b-full \
  --device cuda:0

Running with vLLM for Speed

# vLLM backend is 3-5x faster for large models
lm_eval \
  --model vllm \
  --model_args pretrained=mistralai/Mistral-7B-v0.1,dtype=float16,tensor_parallel_size=1 \
  --tasks mmlu \
  --num_fewshot 5 \
  --batch_size auto \
  --output_path ./results/mistral-vllm \
  --device cuda

Parsing and Comparing Results

import json
import os
from pathlib import Path

def load_results(results_dir: str) -> dict:
    """Load lm-evaluation-harness results from a directory."""
    results_path = Path(results_dir)
    result_files = list(results_path.glob("*.json"))

    if not result_files:
        raise FileNotFoundError(f"No result files found in {results_dir}")

    # Use the most recent file
    result_file = sorted(result_files)[-1]

    with open(result_file) as f:
        data = json.load(f)

    return data

def extract_scores(results: dict) -> dict:
    """Extract key scores from lm-eval results structure."""
    scores = {}

    for task_name, task_results in results.get("results", {}).items():
        # MMLU reports per-subject accuracy - average them
        if "acc,none" in task_results:
            scores[task_name] = task_results["acc,none"]
        elif "acc_norm,none" in task_results:
            scores[task_name] = task_results["acc_norm,none"]
        elif "pass@1,none" in task_results:
            scores[task_name] = task_results["pass@1,none"]

    return scores

def compare_models(model_dirs: dict) -> None:
    """Compare multiple model evaluation results side by side."""
    all_scores = {}

    for model_name, results_dir in model_dirs.items():
        try:
            results = load_results(results_dir)
            scores = extract_scores(results)
            all_scores[model_name] = scores
        except FileNotFoundError as e:
            print(f"Warning: {e}")
            continue

    # Get union of all tasks
    all_tasks = set()
    for scores in all_scores.values():
        all_tasks.update(scores.keys())

    # Print comparison table
    task_list = sorted(all_tasks)
    model_list = list(all_scores.keys())

    header = f"{'Task':<40}" + "".join(f"{m:<15}" for m in model_list)
    print(header)
    print("-" * len(header))

    for task in task_list:
        row = f"{task:<40}"
        for model in model_list:
            score = all_scores.get(model, {}).get(task, None)
            if score is not None:
                row += f"{score*100:<15.1f}"
            else:
                row += f"{'N/A':<15}"
        print(row)

# Example usage
model_dirs = {
    "Mistral-7B": "./results/mistral-7b-full",
    "Llama3-8B": "./results/llama3-8b-full",
}
compare_models(model_dirs)

Custom Task Evaluation

# You can write custom tasks in lm-eval format
# This is useful for domain-specific evaluation

import lm_eval
from lm_eval.api.task import ConfigurableTask
from lm_eval.api.instance import Instance

# Define a custom YAML task config
custom_task_yaml = """
task: my_document_extraction_task
dataset_path: json
dataset_kwargs:
  data_files:
    test: ./data/extraction_test.jsonl
doc_to_text: "Extract the vendor name from this invoice:\n{{document}}\nVendor name:"
doc_to_target: "{{vendor_name}}"
metric_list:
  - metric: exact_match
    aggregation: mean
    higher_is_better: true
output_type: generate_until
generation_kwargs:
  until:
    - "\n"
    - "."
  max_gen_toks: 50
"""

# Save the custom task config
with open("./custom_tasks/my_extraction_task.yaml", "w") as f:
    f.write(custom_task_yaml)

# Run evaluation with the custom task
# lm_eval --model hf \
#   --model_args pretrained=your-model \
#   --tasks my_document_extraction_task \
#   --include_path ./custom_tasks \
#   --output_path ./results/custom_eval

Computing Benchmark Correlation With Your Task

import numpy as np
from scipy.stats import spearmanr

def compute_benchmark_task_correlation(
    benchmark_scores: dict,  # {model_name: benchmark_score}
    task_scores: dict,       # {model_name: your_task_score}
) -> dict:
    """
    Compute correlation between leaderboard benchmarks and your task.

    Use this to determine which benchmarks are predictive for your use case.
    If a benchmark has low correlation, it is not a useful proxy for your task.
    """
    models = list(set(benchmark_scores.keys()) & set(task_scores.keys()))

    if len(models) < 5:
        print("Warning: fewer than 5 models - correlation estimate unreliable")

    b_scores = [benchmark_scores[m] for m in models]
    t_scores = [task_scores[m] for m in models]

    rho, p_value = spearmanr(b_scores, t_scores)

    return {
        "spearman_rho": rho,
        "p_value": p_value,
        "n_models": len(models),
        "is_significant": p_value < 0.05,
        "interpretation": (
            "strong predictor" if abs(rho) > 0.7 else
            "moderate predictor" if abs(rho) > 0.4 else
            "weak predictor - do not rely on this benchmark"
        )
    }

# Example: you evaluated 10 models on MMLU and on your extraction task
mmlu_scores = {
    "Mistral-7B": 0.624, "Llama3-8B": 0.684, "Phi-3-mini": 0.688,
    "Gemma-7B": 0.643, "Qwen2-7B": 0.707, "Yi-6B": 0.642,
    "Falcon-7B": 0.555, "MPT-7B": 0.570, "StableLM-7B": 0.430,
    "OpenLlama-7B": 0.420
}

your_task_scores = {
    "Mistral-7B": 0.82, "Llama3-8B": 0.87, "Phi-3-mini": 0.79,
    "Gemma-7B": 0.81, "Qwen2-7B": 0.89, "Yi-6B": 0.80,
    "Falcon-7B": 0.71, "MPT-7B": 0.68, "StableLM-7B": 0.55,
    "OpenLlama-7B": 0.52
}

result = compute_benchmark_task_correlation(mmlu_scores, your_task_scores)
print(f"MMLU vs your task: rho={result['spearman_rho']:.2f}, {result['interpretation']}")

Production Engineering Notes

GPU Memory Requirements for Evaluation

Running lm-evaluation-harness requires loading the full model. For a 7B parameter model in float16, you need approximately 14GB of GPU memory. For a 13B model, approximately 26GB. Use the following rough formula:

$\text{VRAM (GB)} \approx \frac{\text{parameters} \times \text{bytes per parameter}}{10^9}$

For float16 (2 bytes): a 7B model needs $7 \times 10^9 \times 2 / 10^9 = 14$ GB. Add 20% buffer for activations and KV cache during evaluation.

Batch Size and Throughput

Evaluation time scales inversely with batch size up to the memory limit. A reasonable batch size for 7B models on an A100 (80GB) is 16-32 for MMLU. For GSM8K (which requires generation), batch size must be smaller because the generated sequences consume additional memory.

Use --batch_size auto to let the harness find the maximum batch size that fits in memory without OOM errors.

Reproducing Leaderboard Numbers

The leaderboard uses specific prompt formats, few-shot counts, and normalization settings. Small changes in any of these can shift scores by 1-3 points. To reproduce leaderboard numbers exactly:

Use the same version of lm-evaluation-harness that the leaderboard used (check the model card for the commit hash)
Use the same few-shot counts: MMLU=5, ARC-Challenge=25, HellaSwag=10, WinoGrande=5, GSM8K=5, TruthfulQA=0
Use the same normalization: some tasks use acc_norm (length-normalized log-probabilities) rather than acc
Apply any chat template that was used during fine-tuning - instruct models require the correct template to reproduce their reported scores

Evaluation on Limited Hardware

If you do not have 80GB GPUs, you can still run evaluations:

# 4-bit quantization reduces memory roughly 4x
lm_eval \
  --model hf \
  --model_args pretrained=mistralai/Mistral-7B-v0.1,load_in_4bit=True \
  --tasks mmlu \
  --num_fewshot 5 \
  --batch_size 4 \
  --device cuda:0

Note that quantization affects scores. Scores under 4-bit quantization are typically 0.5-2 points lower than fp16 scores. Keep this in mind when comparing your local results to leaderboard numbers.

Evaluating Multiple Models in a CI Pipeline

A practical pattern is to run a subset of benchmarks on every model candidate before doing a full evaluation. The fast filter approach runs MMLU (5-shot) and ARC-Challenge (25-shot) first because they complete in 30-60 minutes on a single A100. If a model fails to exceed a minimum bar on these fast benchmarks, it is eliminated before the expensive full suite.

#!/bin/bash
# fast_filter.sh — run the fast benchmark subset first

MODEL=$1
OUTPUT_DIR="./results/${MODEL//\//-}"
THRESHOLD_MMLU=0.60   # drop models below 60% MMLU
THRESHOLD_ARC=0.55    # drop models below 55% ARC-Challenge

lm_eval \
  --model hf \
  --model_args "pretrained=${MODEL},dtype=float16" \
  --tasks mmlu,arc_challenge \
  --num_fewshot 5,25 \
  --batch_size 8 \
  --output_path "${OUTPUT_DIR}/fast_filter" \
  --device cuda:0

python scripts/check_threshold.py \
  --results "${OUTPUT_DIR}/fast_filter" \
  --mmlu-threshold $THRESHOLD_MMLU \
  --arc-threshold $THRESHOLD_ARC \
  --exit-on-fail   # exits with code 1 if model fails threshold

# check_threshold.py
import json
import sys
import argparse
from pathlib import Path

def check_threshold(results_dir: str, mmlu_threshold: float, arc_threshold: float, exit_on_fail: bool) -> bool:
    result_files = list(Path(results_dir).glob("*.json"))
    if not result_files:
        print("No result files found")
        return False

    with open(sorted(result_files)[-1]) as f:
        data = json.load(f)

    results = data.get("results", {})

    # Average MMLU across all subtasks
    mmlu_scores = [v.get("acc_norm,none", v.get("acc,none", 0))
                   for k, v in results.items() if k.startswith("mmlu_")]
    avg_mmlu = sum(mmlu_scores) / len(mmlu_scores) if mmlu_scores else 0

    arc_score = results.get("arc_challenge", {}).get("acc_norm,none", 0)

    print(f"MMLU average: {avg_mmlu*100:.1f}%  (threshold: {mmlu_threshold*100:.0f}%)")
    print(f"ARC-Challenge: {arc_score*100:.1f}%  (threshold: {arc_threshold*100:.0f}%)")

    passes = avg_mmlu >= mmlu_threshold and arc_score >= arc_threshold

    if not passes:
        print("FAIL: model does not meet minimum thresholds")
        if exit_on_fail:
            sys.exit(1)
    else:
        print("PASS: model meets minimum thresholds, proceed to full evaluation")

    return passes

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--results", required=True)
    parser.add_argument("--mmlu-threshold", type=float, default=0.60)
    parser.add_argument("--arc-threshold", type=float, default=0.55)
    parser.add_argument("--exit-on-fail", action="store_true")
    args = parser.parse_args()
    check_threshold(args.results, args.mmlu_threshold, args.arc_threshold, args.exit_on_fail)

Tracking Benchmark Scores Over Time

Benchmark scores should be version-controlled along with model checkpoints. A minimal tracking pattern stores scores as JSON in a dedicated directory committed to the repository, so every model version has a persistent score record:

import json
import datetime
from pathlib import Path

def save_benchmark_record(
    model_name: str,
    scores: dict,
    eval_harness_version: str,
    records_dir: str = "./eval_records"
) -> None:
    """Save benchmark scores to a versioned JSON record."""
    record = {
        "model": model_name,
        "timestamp": datetime.datetime.utcnow().isoformat() + "Z",
        "eval_harness_version": eval_harness_version,
        "scores": scores,
    }

    Path(records_dir).mkdir(exist_ok=True)
    slug = model_name.replace("/", "-")
    filename = f"{records_dir}/{slug}_{datetime.date.today()}.json"

    with open(filename, "w") as f:
        json.dump(record, f, indent=2)

    print(f"Saved benchmark record to {filename}")

Common Mistakes

:::danger Do not use aggregate scores to make deployment decisions

The leaderboard shows an average score across multiple benchmarks. A model averaging 68% might score 80% on MMLU and 56% on GSM8K. If your application requires math reasoning, this model is poorly suited regardless of its aggregate rank. Always decompose the aggregate into per-benchmark scores and map each benchmark to your task domain before drawing conclusions.

:::

:::danger Do not assume a high MMLU score means the model is smart

MMLU is a multiple-choice knowledge retrieval test. A model with 76% MMLU can still generate factually incorrect long-form text, fail to follow multi-step instructions, and hallucinate document details. MMLU measures what the model knows, not whether it uses knowledge correctly in generation mode. These are different capabilities.

:::

:::warning Benchmark numbers from different evaluation setups are not comparable

If one paper evaluates a model with 0-shot MMLU and another uses 5-shot, those numbers are not comparable. 5-shot MMLU scores are typically 3-8 points higher than 0-shot for the same model. Always check the evaluation setup before comparing numbers from different sources.

:::

:::warning Chat template matters for instruct models

Instruct-tuned models (Mistral-Instruct, Llama-3-Instruct, etc.) expect input formatted with their specific system/user/assistant template. If you run lm-evaluation-harness without specifying the correct chat template, you may get scores 5-15 points below the model's actual capability. Check the model card and set --apply_chat_template when evaluating instruct variants.

:::

:::warning Benchmark saturation invalidates comparisons at the top of the distribution

When many models cluster above 85% on HellaSwag or above 90% on WinoGrande, differences of 0.1-0.3 points are within noise. These benchmarks have saturated for frontier models. At saturation, the benchmark is no longer measuring what you think it is measuring - it is measuring small numerical artifacts. Use v2 benchmarks (GPQA, MUSR, MATH) for discriminating between strong models.

:::

Interview Q&A

Q1: What does a high MMLU score actually tell you about a model, and what does it not tell you?

A high MMLU score tells you that the model has broad factual and conceptual knowledge encoded in its weights across a wide range of academic subjects. The model was exposed to sufficient training data covering medicine, law, mathematics, history, science, and other domains that it can answer 4-choice multiple choice questions correctly. This correlates moderately with general intelligence and knowledge breadth.

What MMLU does not tell you: how the model performs in generation mode (not just token classification), whether it can reason about unfamiliar combinations of facts, how it handles long documents, whether it follows instruction formats, or whether its generated text is factually consistent. A 75% MMLU model can still produce heavily hallucinated outputs in open-ended generation. The test format - selecting the highest-probability token from four options - is fundamentally different from generating coherent, accurate text.

Q2: Explain benchmark contamination. Why is it a problem and how do you detect it?

Benchmark contamination occurs when examples from a benchmark's test set appear in a model's training data. Because common web corpora contain educational websites, GitHub repositories, and forum posts that quote benchmark questions and answers, models trained on broad internet data have a non-trivial probability of having memorized specific test examples. When the model is evaluated on those examples, it is retrieving memorized answers rather than demonstrating reasoning, which inflates the measured score.

Detection approaches: (1) Check whether the model performs significantly better on the exact benchmark phrasing than on semantically equivalent paraphrases. A large gap suggests memorization. (2) Evaluate on held-out benchmark splits or newer benchmarks that post-date the model's training data cutoff. (3) Check if the model's logit distribution is unusually sharp on benchmark questions compared to similar out-of-distribution questions - memorized examples produce high-confidence predictions. (4) Some developers publish decontamination reports that list how many n-grams from the benchmark appeared in training data.

Q3: The Open LLM Leaderboard introduced v2 benchmarks (GPQA, MUSR, MATH, IFEval, BBH). Why were the original benchmarks replaced?

The original v1 benchmarks (MMLU, ARC, HellaSwag, WinoGrande, GSM8K, TruthfulQA) were replaced for three reasons.

First, saturation: by 2024, many models were scoring 80-90% on HellaSwag and WinoGrande. At that level the benchmarks no longer discriminate between models - the variance is noise. You cannot use a saturated benchmark to choose between two strong models.

Second, contamination: the v1 benchmarks had been public for years and were almost certainly present in the training data of most models evaluated. The v2 benchmarks, particularly GPQA (created more recently with expert questions designed not to be findable via web search), are harder to contaminate.

Third, predictive validity: analysis showed that v1 scores did not correlate as well with human preference evaluations (like Chatbot Arena ELO ratings) as expected. The v2 benchmarks, particularly GPQA and IFEval, correlate better with the kinds of capabilities humans care about in practice.

Q4: How would you use lm-evaluation-harness to decide between three candidate models for a document summarization application?

I would run a two-stage evaluation. In the first stage, I would use lm-evaluation-harness to narrow the field by running benchmarks that correlate with summarization-relevant capabilities: MMLU (knowledge breadth, since good summaries require understanding the document domain), TruthfulQA (factual reliability, since summaries must not introduce false claims), and BBH (instruction following and reasoning). I would look for models that score consistently well on these three rather than strong on one and weak on others.

In the second stage, I would build a domain-specific evaluation set of 150-300 documents representative of the actual production distribution - same length, same domain, same format. I would generate summaries from all three models and evaluate using a combination of automated metrics (BERTScore for semantic similarity to reference summaries, factual consistency using an NLI model against the source document) and LLM-as-judge scoring on a rubric covering conciseness, completeness, and factual accuracy.

The leaderboard scores serve as a filter to eliminate obviously unsuitable models, not as a final decision. The domain-specific evaluation drives the final choice.

Q5: A model has a 71% MMLU score but only 42% on TruthfulQA. What does this combination tell you, and for which applications would you consider versus avoid this model?

This combination suggests a model that has broad knowledge encoded in its weights but is prone to reproducing common misconceptions and plausible-sounding falsehoods rather than correcting them. High MMLU means the model knows a lot. Low TruthfulQA means the model does not always choose accuracy over plausibility when the two conflict.

Applications where I would consider this model: tasks where the input fully constrains the output (code generation, structured extraction, classification, format conversion) - these tasks require knowledge and instruction following but do not require the model to adjudicate between accurate and plausible when they differ. Also useful as a retrieval-augmented generation reader where the retrieved context grounds the output.

Applications where I would avoid this model: open-domain question answering without retrieval (where the model must rely on parametric memory and choose accurate over plausible answers), medical or legal information provision (where wrong plausible answers cause harm), fact-checking tasks, or any application where the model's outputs are consumed without human review. The 42% TruthfulQA score is a signal that this model actively generates confident falsehoods at a meaningful rate.

Q6: What is the difference between acc and acc_norm in lm-evaluation-harness output, and when does the choice matter?

acc (accuracy) selects the answer option whose token log-probability is highest, without any normalization. acc_norm (length-normalized accuracy) divides the total log-probability of each option by its token length before selecting the highest.

The choice matters when answer options have very different lengths. If option A is "Yes" and option B is "No, because the treaty of 1842 specified that..." then option B has a lower total log-probability simply because it is longer - more tokens means more probability mass multiplied together, and multiplication of probabilities less than 1 always produces a smaller result. Without normalization, the model appears to prefer short answers even when it actually assigns higher per-token probability to the long correct answer.

For most standard benchmarks (MMLU, ARC, HellaSwag), the evaluation harness and leaderboard use acc_norm specifically because answer choices have varying lengths. Using acc instead of acc_norm for these benchmarks will give lower scores and will not match leaderboard numbers. The choice can shift scores by 2-5 points, which is large enough to change your model ranking.

Summary

The HuggingFace Open LLM Leaderboard solved the reproducibility problem that plagued LLM evaluation before 2023. Every score on the leaderboard was produced with the same code, the same prompts, and the same hardware. That standardization has real value.

But standardized benchmarks create standardized optimization targets, and the leaderboard has been gamed - through accidental contamination, deliberate fine-tuning to benchmark distributions, and checkpoint selection. The v2 benchmarks are harder to game and correlate better with real capability, but the fundamental limitation remains: any benchmark measures performance on that specific benchmark, not on your specific task.

The right way to use leaderboard scores is as a filter: narrow down to a short list of models that score well on benchmarks correlated with your use case, then use your own domain-specific evaluation to make the final decision. The leaderboard is a starting point, not a finish line.

In the next lesson, we will build the domain-specific evaluation infrastructure that turns that starting point into a reliable production decision.

The Monday Morning Meeting​

Why This Exists​

The Problem Before Standard Benchmarks​

What the Leaderboard Solves​

Why It Also Created New Problems​

Historical Context​

The Benchmark Lineage​

Core Concepts: What Each Benchmark Actually Measures​

MMLU: Knowledge Breadth, Not Intelligence​

HellaSwag: Commonsense Completion​

ARC-Challenge: Multi-Step Science Reasoning​

WinoGrande: Pronoun Disambiguation Under Commonsense Constraints​

GSM8K: Grade School Math With Chain of Thought​

HumanEval: Functional Correctness for Code​

TruthfulQA: Resistance to Plausible Falsehoods​

The Benchmark Contamination Problem​

How Contamination Happens​

Measuring Contamination​

The Gaming Problem​

Open LLM Leaderboard v2: Harder Benchmarks​

How to Actually Use Leaderboard Scores​

The Correlation Question​

Domain-Specific Interpretation​

The Floor and Ceiling Problem​

What the Leaderboard Cannot Tell You​

Mermaid Diagrams​

Benchmark Coverage Map​

Contamination and Gaming Pipeline​

How to Interpret a Leaderboard Score​

Code: Running lm-evaluation-harness Locally​

Installation​

Running a Single Benchmark​

Running with vLLM for Speed​

Parsing and Comparing Results​

Custom Task Evaluation​

Computing Benchmark Correlation With Your Task​

Production Engineering Notes​

GPU Memory Requirements for Evaluation​

Batch Size and Throughput​

Reproducing Leaderboard Numbers​

Evaluation on Limited Hardware​

Evaluating Multiple Models in a CI Pipeline​

Tracking Benchmark Scores Over Time​

Common Mistakes​

Interview Q&A​

Summary​