Module 06: LLM Evaluation

Evaluation is the discipline that separates research demos from production systems. A model that scores 90% on MMLU might hallucinate on every third user query. A model with low perplexity on Wikipedia might refuse to answer legitimate questions. Evaluation is not a checkbox - it is an ongoing engineering practice.

This module covers the full evaluation stack: from fundamental metrics like perplexity, through generation quality metrics, human evaluation methodology, automated LLM judging, benchmark ecosystems, safety evaluation, RAG-specific metrics, and finally production monitoring. Each layer builds on the last.

Why Evaluation Is Hard

The core challenge is that language is high-dimensional and context-dependent. There is no single number that captures whether a model is "good." A model might excel at factual recall but fail at reasoning. It might be accurate but unsafe. It might perform well on benchmarks that leaked into its training data. Good evaluation requires a portfolio of complementary techniques.

Module Map

Lessons in This Module

#	Lesson	Core Concept	Key Tools
01	Perplexity and Language Model Metrics	How surprised is the model?	HuggingFace Transformers
02	BLEU, ROUGE, and Generation Metrics	Reference-based quality	`evaluate`, BERTScore
03	Human Evaluation	Ground truth from humans	Annotation platforms, ELO
04	LLM-as-Judge	Automated evaluation with LLMs	OpenAI, Anthropic, Prometheus
05	Benchmarks: MMLU, HumanEval, HELM	Standardized capability measurement	`lm-eval-harness`
06	Safety and Bias Evaluation	Harmful outputs, stereotypes, jailbreaks	ToxiGen, HarmBench
07	RAG Evaluation Metrics	Faithfulness, relevance, groundedness	RAGAS, TruLens
08	Production Monitoring for LLMs	Latency, drift, cost, safety	LangSmith, Langfuse

Prerequisites

Familiarity with LLM architecture and training (Modules 01–05)
Basic statistics: probability distributions, expected value, variance
Python proficiency: working with HuggingFace and API clients

Key Concepts Glossary

Perplexity - The geometric mean inverse probability assigned to a test set; lower is better.

BLEU - Bilingual Evaluation Understudy; n-gram precision metric originally designed for machine translation.

ROUGE - Recall-Oriented Understudy for Gisting Evaluation; recall-based metric for summarization.

BERTScore - Semantic similarity metric using contextual embeddings from BERT.

LLM-as-Judge - Using a capable LLM (e.g., GPT-4) to score outputs of other models.

MT-Bench - Multi-turn benchmark for evaluating LLM conversation quality using GPT-4 as judge.

MMLU - Massive Multitask Language Understanding; 57-subject multiple-choice benchmark.

HumanEval - 164 Python programming problems evaluated by running unit tests.

HELM - Holistic Evaluation of Language Models; multi-metric evaluation framework from Stanford.

RAGAS - RAG Assessment framework for evaluating faithfulness, relevance, and context metrics.

RAG Triad - Answer Relevance + Context Relevance + Groundedness (TruLens framework).

ELO Rating - Chess-derived ranking system applied to human preference comparison of models.

Inter-Annotator Agreement (IAA) - Statistical measure of how consistently human annotators label data (Cohen's kappa).

Benchmark Contamination - When test data appears in training data, inflating benchmark scores.

Red Teaming - Adversarial evaluation where humans or automated systems attempt to elicit harmful outputs.

TTFT - Time to First Token; key latency metric for streaming LLM responses.

How to Use This Module

Each lesson is self-contained but builds on the previous. If you are building a RAG system today, jump to Lesson 07. If you are deploying a model to production, Lesson 08 is critical. For hiring or being hired as an ML engineer, the interview sections in every lesson cover the questions that actually get asked.

The code examples use real libraries - evaluate, ragas, langsmith, langchain, and the anthropic SDK. All examples are designed to run with minimal setup.

Why Evaluation Is Hard​

Module Map​

Lessons in This Module​

Prerequisites​

Key Concepts Glossary​

How to Use This Module​