Artifact Management & Experiment Organization
Managing ML artifacts at scale - naming conventions, tagging, parent-child relationships, archival policies, and finding the model that became production from 2000 runs.
Managing ML artifacts at scale - naming conventions, tagging, parent-child relationships, archival policies, and finding the model that became production from 2000 runs.
Build fully automated trigger-based model retraining pipelines - from drift detection through training to production deployment, with human-in-the-loop approval.
Understand why standard software CI/CD is insufficient for ML and what additional stages you need to catch real failures.
Systematic model comparison and selection - metric design, statistical significance testing, champion-challenger frameworks, and making defensible production promotion decisions.
Manage ML container images in CI/CD pipelines - registry choices, image tagging, multi-architecture builds, Trivy scanning, and environment promotion workflows.
Design continuous training systems that safely update models every few hours - covering CT maturity levels, warm-starting, failure modes, and monitoring.
Enforcing data quality agreements between producers and consumers - schema contracts with Pandera and Great Expectations, statistical contracts, SLA contracts, CI integration, and violation alerting.
Tracking dataset provenance, preventing train/val/test leakage, stratified splitting, dataset registries, and discovering the CV team's 12% accuracy inflation from augmentation leakage.
Delta Lake as ML data infrastructure - ACID transactions, time travel, schema evolution, Delta + MLflow integration, OPTIMIZE/Z-ordering, and handling schema changes without breaking pipelines.
Build a complete local ML development environment with Docker Compose - training, serving, feature store, and monitoring all running with a single command.
Learn Docker fundamentals from an ML perspective - why containers matter, how to write effective Dockerfiles, and how to manage ML model files in containers.
DVC in production - pointer files, remote storage, pipeline definitions (dvc.yaml), caching, dvc repro, CI/CD integration, and versioning 500GB datasets without bloating git.
Build a complete ML CI pipeline in GitHub Actions that triggers training only when training data or model code changes - not on every commit.
Build an enterprise-grade ML CI/CD pipeline in GitLab CI - from data commit to production deployment with DAG pipelines, GPU runners, and environments.
Build and run GPU-enabled containers for ML - covering NVIDIA Container Toolkit, CUDA compatibility, Kubernetes GPU scheduling, and debugging GPU access.
Systematic HPO - grid search, random search, Bayesian optimization with Optuna, Hyperband/ASHA pruning, and multi-objective optimization for production ML.
Why ML teams need Infrastructure as Code - reproducible environments, audit trails, cost control, and eliminating the manual infrastructure chaos that breaks ML at scale.
Production MLflow setup for teams - tracking server architecture, autologging, custom logging, model registry, nested runs for HPO, and scaling to 500+ experiments per week.
Learn how to use the MLflow Model Registry to manage model versions, stages, approval workflows, and webhooks for production ML teams.
Design automated model quality gates that block promotion when a model fails on demographic subgroups - not just on aggregate metrics.
Understand what a model registry is, why it exists, and how it brings order to the chaos of managing ML models in production.
Design versioning schemes for ML models that support safe rollbacks, A/B testing, champion/challenger management, and backward compatibility.
Systematic tracking of ML experiments - hyperparameters, metrics, artifacts, and models - so your team can reproduce results, compare runs, and ship better models faster.
Versioning datasets as first-class artifacts - DVC, Delta Lake, dataset lineage, data contracts, and managing ML datasets at scale.
Build CI/CD pipelines that catch ML-specific failures - not just broken code, but broken models.
Master Docker and containers for ML - from Dockerfiles to GPU containers, image optimization, and Docker Compose for reproducible ML development environments.
Understand what MLOps is, why it exists, and how to think about operationalizing machine learning systems in production.
Master Infrastructure as Code for ML systems - Terraform, Pulumi, GitOps, secret management, and cost optimization through declarative infrastructure.
Master the model registry - the system that brings order, traceability, and governance to every model your team ships to production.
Reduce ML Docker images from 8GB to under 1.5GB using multi-stage builds, slim bases, BuildKit cache mounts, and image scanning.
Learn the four layers of ML reproducibility - environment, data, code, and model - and how to achieve each in practice with Docker, DVC, MLflow, and seed management.
Use Istio service mesh to manage traffic routing across multiple ML model versions - canary deployments, A/B testing, circuit breakers, and telemetry.
Build a practical ML test suite from zero - covering the full pyramid from unit tests through model validation without testing everything.
Understand the end-to-end MLOps lifecycle, maturity levels 0–3, the nine components of production ML, and why ML deployment is categorically different from software deployment.
W&B for production ML teams - run tracking, sweeps, artifact versioning, collaborative reports, alerts, and how it compares to MLflow.
The case for treating datasets as first-class versioned artifacts - regulatory requirements, reproducibility, drift detection, and the approaches to versioning (full copy, delta, pointer).
The business and technical case for tracking every ML experiment - what to track, why it matters, and what happens when you don't.