Module 06: LLM Evaluation
Evaluation is the discipline that separates research demos from production systems. A model that scores 90% on MMLU might hallucinate on every third user query. A model with low perplexity on Wikipedia might refuse to answer legitimate questions. Evaluation is not a checkbox - it is an ongoing engineering practice.
This module covers the full evaluation stack: from fundamental metrics like perplexity, through generation quality metrics, human evaluation methodology, automated LLM judging, benchmark ecosystems, safety evaluation, RAG-specific metrics, and finally production monitoring. Each layer builds on the last.
Why Evaluation Is Hard
The core challenge is that language is high-dimensional and context-dependent. There is no single number that captures whether a model is "good." A model might excel at factual recall but fail at reasoning. It might be accurate but unsafe. It might perform well on benchmarks that leaked into its training data. Good evaluation requires a portfolio of complementary techniques.
Module Map
Lessons in This Module
| # | Lesson | Core Concept | Key Tools |
|---|---|---|---|
| 01 | Perplexity and Language Model Metrics | How surprised is the model? | HuggingFace Transformers |
| 02 | BLEU, ROUGE, and Generation Metrics | Reference-based quality | evaluate, BERTScore |
| 03 | Human Evaluation | Ground truth from humans | Annotation platforms, ELO |
| 04 | LLM-as-Judge | Automated evaluation with LLMs | OpenAI, Anthropic, Prometheus |
| 05 | Benchmarks: MMLU, HumanEval, HELM | Standardized capability measurement | lm-eval-harness |
| 06 | Safety and Bias Evaluation | Harmful outputs, stereotypes, jailbreaks | ToxiGen, HarmBench |
| 07 | RAG Evaluation Metrics | Faithfulness, relevance, groundedness | RAGAS, TruLens |
| 08 | Production Monitoring for LLMs | Latency, drift, cost, safety | LangSmith, Langfuse |
Prerequisites
- Familiarity with LLM architecture and training (Modules 01–05)
- Basic statistics: probability distributions, expected value, variance
- Python proficiency: working with HuggingFace and API clients
Key Concepts Glossary
Perplexity - The geometric mean inverse probability assigned to a test set; lower is better.
BLEU - Bilingual Evaluation Understudy; n-gram precision metric originally designed for machine translation.
ROUGE - Recall-Oriented Understudy for Gisting Evaluation; recall-based metric for summarization.
BERTScore - Semantic similarity metric using contextual embeddings from BERT.
LLM-as-Judge - Using a capable LLM (e.g., GPT-4) to score outputs of other models.
MT-Bench - Multi-turn benchmark for evaluating LLM conversation quality using GPT-4 as judge.
MMLU - Massive Multitask Language Understanding; 57-subject multiple-choice benchmark.
HumanEval - 164 Python programming problems evaluated by running unit tests.
HELM - Holistic Evaluation of Language Models; multi-metric evaluation framework from Stanford.
RAGAS - RAG Assessment framework for evaluating faithfulness, relevance, and context metrics.
RAG Triad - Answer Relevance + Context Relevance + Groundedness (TruLens framework).
ELO Rating - Chess-derived ranking system applied to human preference comparison of models.
Inter-Annotator Agreement (IAA) - Statistical measure of how consistently human annotators label data (Cohen's kappa).
Benchmark Contamination - When test data appears in training data, inflating benchmark scores.
Red Teaming - Adversarial evaluation where humans or automated systems attempt to elicit harmful outputs.
TTFT - Time to First Token; key latency metric for streaming LLM responses.
How to Use This Module
Each lesson is self-contained but builds on the previous. If you are building a RAG system today, jump to Lesson 07. If you are deploying a model to production, Lesson 08 is critical. For hiring or being hired as an ML engineer, the interview sections in every lesson cover the questions that actually get asked.
The code examples use real libraries - evaluate, ragas, langsmith, langchain, and the anthropic SDK. All examples are designed to run with minimal setup.
