Module 6: Evaluating Open Models

Benchmarks are lying to you. MMLU scores measure knowledge, not reasoning. HumanEval scores have significant contamination from training data. MT-Bench scores are sensitive to judge model selection. If you pick a model based on leaderboard rankings without understanding what each benchmark measures and how it can be gamed, you will ship a model that looks great on paper and underperforms in production.

The only eval that matters is the one you build on your own data, for your own task, measuring what your users actually experience. This module teaches you how to build that eval, how to use LLM-as-judge to scale it, and how to run regression tests after every fine-tuning run.

The Eval Hierarchy

Lessons in This Module

#	Lesson	Key Concept
1	Standard Benchmarks Overview	MMLU, HumanEval, GSM8K, MT-Bench - what each measures
2	Benchmark Contamination Problem	Data leakage into training, decontamination methods
3	Building Your Own Eval Suite	Golden dataset creation, test case design
4	LLM-as-Judge for Open Models	Prometheus, Llama 3 as judge, inter-rater reliability
5	Domain-Specific Evaluation	Vertical benchmarks, custom rubrics
6	Comparing Open vs Closed Models	Fair comparison methodology, cost-adjusted metrics
7	Regression Testing After Fine-Tuning	Automated regression gates, CI/CD for model quality
8	Eval-Driven Model Selection	Using evals to make deployment decisions

Key Concepts You Will Master

Benchmark contamination detection - checking if a model's training data includes your benchmark
LLM-as-judge calibration - measuring agreement between your judge model and human raters
Golden dataset construction - building a test set that covers failure modes, not just easy cases
A/B evaluation methodology - comparing two models on the same prompt set fairly
Regression test gates - automating evaluation so fine-tuning runs cannot degrade production quality

Prerequisites

Model Ecosystem
AI Evaluation
Basic statistics (mean, confidence intervals)

The Eval Hierarchy​

Lessons in This Module​

Key Concepts You Will Master​

Prerequisites​

The Eval Hierarchy

Lessons in This Module

Key Concepts You Will Master

Prerequisites