Skip to main content

8 docs tagged with "ai-evaluation"

View all tags

Building Golden Datasets

Learn how to construct, annotate, validate, and maintain golden datasets that serve as the ground truth foundation for all AI system evaluation - covering annotation guidelines, inter-annotator agreement, adversarial generation, dataset versioning, and drift detection.

Continuous Eval in CI/CD

Design and implement a full CI/CD pipeline for AI systems - covering PR-level linting, merge-level regression, pre-deployment evaluation gates, production monitoring with statistical process control, anomaly detection, automated rollback, and observability tracing from query to feedback.

LLM-as-Judge

Build calibrated, bias-corrected LLM judges that approximate human judgment at scale - pointwise scoring, pairwise comparison, bias mitigation, and ensemble techniques.

Offline vs. Online Evaluation

Design an evaluation strategy that bridges static datasets and production signals - A/B testing, shadow evaluation, implicit signals, and the evaluation flywheel.

RAG-Specific Evaluation

Master the full evaluation stack for Retrieval-Augmented Generation systems - covering RAGAS metrics, hallucination type classification, citation accuracy, retrieval precision/recall/nDCG, and production-grade benchmarking with complete Python implementations.

Regression Testing for Prompts

Build a production-grade regression testing system for LLM prompts - covering test case design, LLM-as-judge pass/fail evaluation, flaky test detection, caching, differential testing, and CI gates that block regressions before they reach users.

Why AI Evaluation Is Hard

Understanding the fundamental gap between software testing and AI evaluation - non-determinism, no oracle, emergent failures, and how to build a multi-layered evaluation strategy.