AI Evaluation - Module Overview
Evaluation is the part of AI engineering that software engineers consistently underestimate - and that AI engineers spend the most time on once they are in production.
In traditional software, a function is correct or it is not. You write a test, run it, and get a deterministic pass or fail. In AI systems, the same input produces different outputs on every run, there is no oracle that definitively determines correctness, and the ways your system can fail are unbounded. You cannot enumerate all the edge cases. You cannot write a unit test for "was this response actually helpful?"
This module is a systematic treatment of AI evaluation: what makes it hard, how practitioners approach it, and how to build evaluation infrastructure that actually scales to production.
Why Evaluation Is the Hardest Part
Three properties of language models make evaluation fundamentally different from software testing:
Non-determinism. The same prompt can produce a hundred valid responses, all slightly different. You cannot assert output == expected. You have to assert that the output is in the space of acceptable outputs - and defining that space is most of the work.
No oracle. In software, you can implement a reference solution and compare. In language model evaluation, there is often no programmatic oracle for "is this a good explanation of quantum entanglement?" Judgment requires judgment.
Emergent failures. Individual components may work correctly while their composition fails. A retriever that finds the right document, a model that summarizes accurately, and a prompt that asks the right question can together produce a response that hallucinates - because the failure lives in the interaction, not in any component.
Module Structure
Lessons in This Module
| # | Lesson | Core Skill |
|---|---|---|
| 01 | Why AI Evaluation Is Hard | Understand the fundamental gap between software testing and AI evaluation |
| 02 | Offline vs. Online Evaluation | Design an evaluation strategy that bridges static datasets and production signals |
| 03 | LLM-as-Judge | Build calibrated, bias-corrected LLM judges that approximate human judgment |
| 04 | Building Golden Datasets | Curate datasets that actually represent production distribution |
| 05 | RAG-Specific Evaluation | Measure retrieval quality, faithfulness, and answer grounding |
| 06 | Regression Testing for Prompts | Catch prompt regressions before they reach production |
| 07 | Continuous Eval in CI/CD | Automate evaluation as a quality gate in your deployment pipeline |
The Central Challenge
The deepest problem in AI evaluation is not technical - it is epistemic. You are trying to measure something that requires judgment, using tools that have imperfect judgment, to make decisions about systems that are non-deterministic.
The practical answer is to use multiple signals at multiple layers: fast automated checks for obvious failures, LLM judges for semantic quality, human review for ambiguous cases, and online metrics for ground truth. No single method is sufficient. The goal is triangulation.
Every lesson in this module builds toward that layered evaluation infrastructure.
Prerequisites
This module assumes familiarity with:
- Basic prompt engineering (Module 10)
- RAG system architecture (Module 11)
- Python and the Anthropic SDK
If you are new to AI engineering, read Modules 10 and 11 first.
What You Will Be Able to Do After This Module
By the end of this module, you will be able to:
- Design a multi-layered evaluation strategy appropriate for your system's risk level
- Build an offline evaluation suite that catches regressions before deployment
- Instrument online metrics that provide production ground truth
- Implement calibrated LLM judges with position and verbosity bias correction
- Curate evaluation datasets that represent real production input distributions
- Measure RAG-specific quality dimensions: retrieval precision, faithfulness, and answer grounding
- Build prompt regression tests that run in CI/CD
- Set up production monitoring dashboards that alert on quality degradation before users notice
Evaluation is not a one-time exercise - it is an ongoing infrastructure investment. The teams that build it well ship better products, find problems faster, and spend less time firefighting production incidents.
The Cost of Not Evaluating
Teams that skip structured evaluation pay a different cost: they find out about failures from users, not from tests. They ship regressions because they changed a prompt without knowing what they broke. They make product decisions based on vibes ("it feels better") rather than measurements. They burn engineering time in incident response for problems that a 30-minute evaluation run would have caught.
The irony is that evaluation feels expensive until you compare it to the cost of not evaluating. A well-instrumented evaluation pipeline - offline tests in CI, an LLM judge sampling production traffic, online metrics on a dashboard - costs roughly 2-5% of engineering time to maintain. The firefighting it prevents costs ten times more.
How to Use This Module
Read the lessons in order - each one builds on the previous. Lesson 01 establishes the conceptual foundation. Lessons 02-03 cover the two most important evaluation strategies. Lessons 04-07 go deeper on dataset construction, metrics, and automation. The code examples in every lesson are production-ready - they use the Anthropic SDK and are designed to be adapted directly into your stack. Start with the lesson that maps to your most pressing problem, then work outward.
The Three Laws of AI Evaluation
Three rules, derived from hard experience in production AI systems, apply across every methodology in this module:
Law 1: Every metric is a proxy. There is no metric that perfectly captures "is this AI response good?" Every evaluation you build is measuring a proxy - something correlated with quality, not quality itself. The moment you forget this and start treating your metric as the thing, Goodhart's Law activates and your system starts optimizing the proxy at the expense of the goal.
Law 2: No single evaluation method is sufficient. Rule-based checks miss semantic failures. LLM judges have calibration biases. Human evaluation does not scale. Online signals lag. Each method has a blind spot. Robust evaluation requires triangulation: multiple methods measuring overlapping dimensions, with discrepancies treated as information rather than noise.
Law 3: Evaluation is a system, not a gate. The teams that treat evaluation as a pre-deployment gate - something you run before shipping and then ignore - build systems that degrade in production. Evaluation must be continuous: running in CI, sampling production traffic, alerting on drift, feeding failures back into the offline dataset. The goal is not a green dashboard at deploy time. The goal is a system that gets measurably better with each release cycle.
Every lesson in this module teaches one part of building that system.
Key Terms in This Module
| Term | Definition |
|---|---|
| Offline evaluation | Running evaluation on a static curated dataset before deployment |
| Online evaluation | Measuring quality on live production traffic after deployment |
| LLM-as-judge | Using a language model to score the outputs of another LLM |
| Calibration | Validating that a judge's scores align with human judgment |
| Position bias | A judge's tendency to prefer the first-presented response in pairwise comparisons |
| Verbosity bias | A judge's tendency to score longer responses higher regardless of quality |
| Evaluation flywheel | The feedback loop that turns production failures into offline test cases |
| Goodhart's Law | When a measure becomes a target, it ceases to be a good measure |
| Inter-annotator agreement | The rate at which different human annotators give the same score (Cohen's kappa) |
| Distribution shift | The difference between your test dataset's input distribution and production's |
Common Evaluation Mistakes (and How This Module Fixes Them)
Most teams starting with AI evaluation make one of five common mistakes. This module addresses all of them directly:
Mistake 1: String matching as the primary evaluator. Teams write assert "tracking number" in response and think they've evaluated quality. Lesson 01 shows exactly why this fails and what to use instead.
Mistake 2: Evaluating only on happy-path examples. Test datasets built from expected queries miss the long tail. Lessons 02 and 04 cover dataset diversity, adversarial cases, and how to use production data to close the gap.
Mistake 3: Using a judge without calibrating it. An uncalibrated LLM judge is a black box with unknown reliability. Lesson 03 covers the full calibration pipeline: human labels, Spearman correlation, bias testing, and ensemble techniques.
Mistake 4: Treating offline evaluation as the only source of truth. Offline evaluation is a necessary but not sufficient signal. Lesson 02 covers the complete online evaluation toolkit: A/B testing, shadow evaluation, implicit signals, and the evaluation flywheel.
Mistake 5: Evaluating once before launch and never again. AI systems degrade over time as production inputs shift. Lesson 07 covers continuous evaluation in CI/CD: automated quality gates, drift detection, and the monitoring infrastructure that keeps your system honest after deployment.
