01AI Evaluation - Module OverviewWhy evaluation is the hardest unsolved problem in AI engineering - and how to approach it systematically.02Why AI Evaluation Is HardUnderstanding the fundamental gap between software testing and AI evaluation - non-determinism, no oracle, emergent failures, and how to build a multi-layered evaluation strategy.03Offline vs. Online EvaluationDesign an evaluation strategy that bridges static datasets and production signals - A/B testing, shadow evaluation, implicit signals, and the evaluation flywheel.04LLM-as-JudgeBuild calibrated, bias-corrected LLM judges that approximate human judgment at scale - pointwise scoring, pairwise comparison, bias mitigation, and ensemble techniques.05Building Golden DatasetsLearn how to construct, annotate, validate, and maintain golden datasets that serve as the ground truth foundation for all AI system evaluation - covering annotation guidelines, inter-annotator agreement, adversarial generation, dataset versioning, and drift detection.06RAG-Specific EvaluationMaster the full evaluation stack for Retrieval-Augmented Generation systems - covering RAGAS metrics, hallucination type classification, citation accuracy, retrieval precision/recall/nDCG, and production-grade benchmarking with complete Python implementations.07Regression Testing for PromptsBuild a production-grade regression testing system for LLM prompts - covering test case design, LLM-as-judge pass/fail evaluation, flaky test detection, caching, differential testing, and CI gates that block regressions before they reach users.08Continuous Eval in CI/CDDesign and implement a full CI/CD pipeline for AI systems - covering PR-level linting, merge-level regression, pre-deployment evaluation gates, production monitoring with statistical process control, anomaly detection, automated rollback, and observability tracing from query to feedback.