37 docs tagged with "devops"

Artifact Management & Experiment Organization

Managing ML artifacts at scale - naming conventions, tagging, parent-child relationships, archival policies, and finding the model that became production from 2000 runs.

Automated Retraining Pipelines

Build fully automated trigger-based model retraining pipelines - from drift detection through training to production deployment, with human-in-the-loop approval.

CI/CD for ML vs Software

Understand why standard software CI/CD is insufficient for ML and what additional stages you need to catch real failures.

Comparing and Selecting Models

Systematic model comparison and selection - metric design, statistical significance testing, champion-challenger frameworks, and making defensible production promotion decisions.

Container Registry and CI

Manage ML container images in CI/CD pipelines - registry choices, image tagging, multi-architecture builds, Trivy scanning, and environment promotion workflows.

Continuous Training

Design continuous training systems that safely update models every few hours - covering CT maturity levels, warm-starting, failure modes, and monitoring.

Data Contracts

Enforcing data quality agreements between producers and consumers - schema contracts with Pandera and Great Expectations, statistical contracts, SLA contracts, CI integration, and violation alerting.

Dataset Lineage and Management

Tracking dataset provenance, preventing train/val/test leakage, stratified splitting, dataset registries, and discovering the CV team's 12% accuracy inflation from augmentation leakage.

Delta Lake and Iceberg for ML

Delta Lake as ML data infrastructure - ACID transactions, time travel, schema evolution, Delta + MLflow integration, OPTIMIZE/Z-ordering, and handling schema changes without breaking pipelines.

Docker Compose for ML Development

Build a complete local ML development environment with Docker Compose - training, serving, feature store, and monitoring all running with a single command.

Docker for ML

Learn Docker fundamentals from an ML perspective - why containers matter, how to write effective Dockerfiles, and how to manage ML model files in containers.

DVC: Data Version Control

DVC in production - pointer files, remote storage, pipeline definitions (dvc.yaml), caching, dvc repro, CI/CD integration, and versioning 500GB datasets without bloating git.

GitHub Actions for ML

Build a complete ML CI pipeline in GitHub Actions that triggers training only when training data or model code changes - not on every commit.

GitLab CI for ML

Build an enterprise-grade ML CI/CD pipeline in GitLab CI - from data commit to production deployment with DAG pipelines, GPU runners, and environments.

GPU Containers

Build and run GPU-enabled containers for ML - covering NVIDIA Container Toolkit, CUDA compatibility, Kubernetes GPU scheduling, and debugging GPU access.

Hyperparameter Optimization

Systematic HPO - grid search, random search, Bayesian optimization with Optuna, Hyperband/ASHA pruning, and multi-objective optimization for production ML.

IaC for ML Teams

Why ML teams need Infrastructure as Code - reproducible environments, audit trails, cost control, and eliminating the manual infrastructure chaos that breaks ML at scale.

MLflow Deep Dive

Production MLflow setup for teams - tracking server architecture, autologging, custom logging, model registry, nested runs for HPO, and scaling to 500+ experiments per week.

MLflow Model Registry in Production

Learn how to use the MLflow Model Registry to manage model versions, stages, approval workflows, and webhooks for production ML teams.

Model Evaluation Gates

Design automated model quality gates that block promotion when a model fails on demographic subgroups - not just on aggregate metrics.

Model Registry Concepts

Understand what a model registry is, why it exists, and how it brings order to the chaos of managing ML models in production.

Model Versioning Strategies

Design versioning schemes for ML models that support safe rollbacks, A/B testing, champion/challenger management, and backward compatibility.

Module 02: Experiment Tracking

Systematic tracking of ML experiments - hyperparameters, metrics, artifacts, and models - so your team can reproduce results, compare runs, and ship better models faster.

Module 03: Data Versioning

Versioning datasets as first-class artifacts - DVC, Delta Lake, dataset lineage, data contracts, and managing ML datasets at scale.

Module 05: CI/CD for ML

Build CI/CD pipelines that catch ML-specific failures - not just broken code, but broken models.

Module 06: Containerization

Master Docker and containers for ML - from Dockerfiles to GPU containers, image optimization, and Docker Compose for reproducible ML development environments.

Module 1 - MLOps Foundations

Understand what MLOps is, why it exists, and how to think about operationalizing machine learning systems in production.

Module 13 - Infrastructure as Code for ML

Master Infrastructure as Code for ML systems - Terraform, Pulumi, GitOps, secret management, and cost optimization through declarative infrastructure.

Module 4 - Model Registry and Lifecycle

Master the model registry - the system that brings order, traceability, and governance to every model your team ships to production.

Optimizing ML Docker Images

Reduce ML Docker images from 8GB to under 1.5GB using multi-stage builds, slim bases, BuildKit cache mounts, and image scanning.

Reproducibility in ML

Learn the four layers of ML reproducibility - environment, data, code, and model - and how to achieve each in practice with Docker, DVC, MLflow, and seed management.

Service Mesh for ML Serving

Use Istio service mesh to manage traffic routing across multiple ML model versions - canary deployments, A/B testing, circuit breakers, and telemetry.

Testing ML Code

Build a practical ML test suite from zero - covering the full pyramid from unit tests through model validation without testing everything.

The MLOps Lifecycle

Understand the end-to-end MLOps lifecycle, maturity levels 0–3, the nine components of production ML, and why ML deployment is categorically different from software deployment.

Weights & Biases Deep Dive

W&B for production ML teams - run tracking, sweeps, artifact versioning, collaborative reports, alerts, and how it compares to MLflow.

Why Data Versioning

The case for treating datasets as first-class versioned artifacts - regulatory requirements, reproducibility, drift detection, and the approaches to versioning (full copy, delta, pointer).

Why Experiment Tracking

The business and technical case for tracking every ML experiment - what to track, why it matters, and what happens when you don't.