Counterfactual Evaluation
Evaluate new ML policies using logged data from an old policy - inverse propensity scoring, doubly robust estimators, and offline policy evaluation for when A/B tests are too expensive.
Evaluate new ML policies using logged data from an old policy - inverse propensity scoring, doubly robust estimators, and offline policy evaluation for when A/B tests are too expensive.
Build and operate ML experimentation infrastructure - assignment services, metric computation pipelines, analysis tools, and the engineering required to scale from 3 to 30 experiments per month.
Use interleaving to compare ranking models with 10-25x better sensitivity than A/B tests - the technique behind fast iteration at search and recommendation companies.
Learn how to design, run, and analyze experiments for ML systems - from statistical foundations to production experimentation platforms.
Use multi-armed bandit algorithms to adaptively allocate traffic during experiments - learning faster than A/B tests while reducing regret.
Design valid ML experiments by choosing the right randomization unit, handling network effects, detecting novelty, and managing holdout sets.
Run new ML models against live production traffic without affecting users - catching silent failures, latency regressions, and behavioral differences before go-live.
Learn the statistical machinery behind A/B testing - null hypotheses, p-values, power, sample size calculation, and the mistakes that invalidate ML experiments.