Skip to main content

110 docs tagged with "mlops"

View all tags

Alerting and Incident Response for ML

ML-specific alerting design, alert taxonomy, routing with PagerDuty and OpsGenie, on-call runbook design for ML models, post-mortem templates, and reducing MTTD and MTTR for ML incidents.

Apache Airflow for ML

Learn how to use Apache Airflow to orchestrate production ML pipelines - DAG authoring, executors, XCom patterns, and avoiding the most common Airflow pitfalls.

Automated Retraining Pipelines

Build fully automated trigger-based model retraining pipelines - from drift detection through training to production deployment, with human-in-the-loop approval.

Autoscaling ML Workloads

Horizontal Pod Autoscaler, KEDA event-driven autoscaling for GPU metrics, zero-downtime rolling updates with readiness gates, and autoscaling patterns for production ML serving.

AWS SageMaker for MLOps

Master the complete AWS SageMaker ecosystem for end-to-end ML workflows - training jobs, pipelines, model registry, feature store, and production inference at scale.

Azure ML for MLOps

Master the Azure Machine Learning platform for enterprise ML workflows - workspaces, component-based pipelines, managed endpoints, MLflow integration, and responsible AI.

Build vs. Buy Economics for ML Tools

Economic analysis for ML tooling decisions - TCO framework, self-hosted vs. managed analysis, hidden costs of self-hosting, and a full financial case for W&B vs. MLflow.

Choosing an Orchestrator

A decision framework for selecting the right ML pipeline orchestrator - comparing Airflow, Prefect, Kubeflow Pipelines, Metaflow, ZenML, and Dagster across team size, maturity, and infrastructure requirements.

CI/CD for ML vs Software

Understand why standard software CI/CD is insufficient for ML and what additional stages you need to catch real failures.

Cloud FinOps for ML

Financial operations for ML cloud spend - FinOps maturity model, reserved instances, spot strategy, multi-account cost attribution, and ML budget forecasting.

Cloud ML Cost Optimization

Master cloud cost management for ML workloads - spot instance strategies, storage optimization, inference cost reduction, FinOps tooling, and real-world cost reduction from $80K to $31K/month.

Comparing and Selecting Models

Systematic model comparison and selection - metric design, statistical significance testing, champion-challenger frameworks, and making defensible production promotion decisions.

Container Registry and CI

Manage ML container images in CI/CD pipelines - registry choices, image tagging, multi-architecture builds, Trivy scanning, and environment promotion workflows.

Continuous Training

Design continuous training systems that safely update models every few hours - covering CT maturity levels, warm-starting, failure modes, and monitoring.

Cost Attribution and Accountability

Making ML teams own their costs - tagging strategy, per-model cost dashboards, chargeback model design, cost anomaly detection, and engineering incentives for cost efficiency.

Counterfactual Evaluation

Evaluate new ML policies using logged data from an old policy - inverse propensity scoring, doubly robust estimators, and offline policy evaluation for when A/B tests are too expensive.

Data Contracts

Enforcing data quality agreements between producers and consumers - schema contracts with Pandera and Great Expectations, statistical contracts, SLA contracts, CI integration, and violation alerting.

Data Drift Detection

Detecting when input data distributions change in production - KS test, PSI, chi-squared, Wasserstein distance, MMD, univariate vs. multivariate drift, reference window selection, and EvidentlyAI.

Databricks for MLOps

Master the Databricks Lakehouse platform for ML - Delta Lake, Unity Catalog, Feature Store, MLflow Model Registry, Model Serving, and Spark-scale feature pipelines for production ML.

Dataset Lineage and Management

Tracking dataset provenance, preventing train/val/test leakage, stratified splitting, dataset registries, and discovering the CV team's 12% accuracy inflation from augmentation leakage.

Delta Lake and Iceberg for ML

Delta Lake as ML data infrastructure - ACID transactions, time travel, schema evolution, Delta + MLflow integration, OPTIMIZE/Z-ordering, and handling schema changes without breaking pipelines.

Docker Compose for ML Development

Build a complete local ML development environment with Docker Compose - training, serving, feature store, and monitoring all running with a single command.

Docker for ML

Learn Docker fundamentals from an ML perspective - why containers matter, how to write effective Dockerfiles, and how to manage ML model files in containers.

DVC: Data Version Control

DVC in production - pointer files, remote storage, pipeline definitions (dvc.yaml), caching, dvc repro, CI/CD integration, and versioning 500GB datasets without bloating git.

Environment Parity

Solve the dev/staging/prod parity problem for ML - feature skew, infrastructure differences, data drift, and environment promotion pipelines that prevent production surprises.

Experimentation Platforms

Build and operate ML experimentation infrastructure - assignment services, metric computation pipelines, analysis tools, and the engineering required to scale from 3 to 30 experiments per month.

Explainability in Production

Serving model explanations alongside predictions - SHAP for production, Anchors for rule-based explanations, explanation as a service, debugging production failures with explanations, and regulatory compliance.

Feature Monitoring in Production

Monitoring features after deployment - PSI, KS tests, freshness monitoring, completeness tracking, and proving to a regulator that no feature drifted more than 10% PSI.

Feature Selection and Importance

Reducing 500 features to 50 without losing model performance - filter, wrapper, and embedded methods, SHAP-based selection, and leakage detection.

Feature Stores in Production

Architecture and operations of feature stores - offline and online layers, point-in-time joins, and avoiding the training-serving skew that costs you accuracy.

Feature Validation and Testing

Ensuring feature quality through schema validation, unit tests, integration tests, and monitoring - catching the NaN bug before it degrades your model for 3 weeks.

Fine-Tuning Ops

Operationalize LLM fine-tuning at scale - data pipelines, LoRA adapter management, adapter registries, and serving 50 customer-specific adapters efficiently.

GitHub Actions for ML

Build a complete ML CI pipeline in GitHub Actions that triggers training only when training data or model code changes - not on every commit.

GitLab CI for ML

Build an enterprise-grade ML CI/CD pipeline in GitLab CI - from data commit to production deployment with DAG pipelines, GPU runners, and environments.

GitOps for ML

Apply GitOps principles to ML infrastructure - Flux CD, ArgoCD, image update automation, secrets management, and PR-gated model deployments with Argo Rollouts.

Google Vertex AI for MLOps

Master the complete Google Vertex AI platform for end-to-end ML workflows - Pipelines, Training, Prediction, Feature Store, Model Registry, Experiments, and production deployment on GCP.

GPU Containers

Build and run GPU-enabled containers for ML - covering NVIDIA Container Toolkit, CUDA compatibility, Kubernetes GPU scheduling, and debugging GPU access.

GPU Scheduling in Kubernetes

GPU resource management in Kubernetes - NVIDIA device plugin, MIG, time-slicing, node affinity, GPU quotas per namespace, and DCGM monitoring for ML clusters.

Helm for ML Deployments

Helm charts for ML applications - chart anatomy, parameterizing ML deployments, environment values files, lifecycle hooks for model validation, and umbrella charts for multi-component stacks.

Hyperparameter Optimization

Systematic HPO - grid search, random search, Bayesian optimization with Optuna, Hyperband/ASHA pruning, and multi-objective optimization for production ML.

IaC for ML Teams

Why ML teams need Infrastructure as Code - reproducible environments, audit trails, cost control, and eliminating the manual infrastructure chaos that breaks ML at scale.

IaC Patterns for ML Platforms

Production IaC patterns for ML platform engineering - golden paths, blue-green infrastructure, self-destructing experiment environments, OPA policies, GPU quota management, and the internal developer platform model.

Inference Cost Optimization

Reducing ML serving costs at scale - quantization ROI, batching economics, instance right-sizing, caching strategies, and LLM cost-per-token analysis.

Infrastructure Monitoring for ML Systems

Monitoring the infrastructure layer of ML systems - CPU, GPU, memory, latency, the four monitoring layers, custom ML metrics with Prometheus, and building the observability foundation for model quality monitoring.

Interleaving Experiments

Use interleaving to compare ranking models with 10-25x better sensitivity than A/B tests - the technique behind fast iteration at search and recommendation companies.

KServe and Kubernetes ML Operators

Custom Kubernetes operators for ML workflows - what operators enable, KServe for standardized model serving, Seldon Core, the Kubeflow Training Operator, Argo Workflows, and when to build vs. use existing operators.

Kubeflow Pipelines

Building, compiling, and running production ML pipelines on Kubernetes using Kubeflow Pipelines v2 with MLMD metadata tracking and automatic retraining triggers.

Kubernetes Fundamentals for ML Engineers

The minimum Kubernetes knowledge every ML engineer needs to be productive - pods, deployments, services, resource requests, GPU allocation, probes, and persistent volumes.

LLM Evaluation Pipelines

Build automated evaluation pipelines for LLM systems - LLM-as-judge, RAGAS for RAG systems, trajectory evaluation for agents, regression testing, and eval dataset curation.

Logging for ML Systems

Structured logging for ML systems - prediction logging for delayed evaluation, structured JSON logs, audit logs for regulated models, log aggregation with Loki and Elasticsearch, and tracing individual prediction failures.

Metaflow

Building scalable, reproducible ML workflows with Netflix's Metaflow - the flow-step model, cloud compute with @batch and @kubernetes, and Cards for documentation.

ML Infrastructure Cost Model

Understanding what drives ML costs - building a cost-per-request model for your ML system from scratch, and computing unit economics the CTO will believe.

ML Pipeline Orchestration Concepts

Understand the fundamental concepts behind ML pipeline orchestration - DAGs, dependency management, idempotency, and why cron jobs are a silent disaster for production ML.

MLflow Deep Dive

Production MLflow setup for teams - tracking server architecture, autologging, custom logging, model registry, nested runs for HPO, and scaling to 500+ experiments per week.

MLOps and Production - Engineering Track

A structured, production-grade MLOps curriculum - experiment tracking, CI/CD for ML, Kubernetes, monitoring, LLMOps, infrastructure as code, and cost management.

MLOps vs DevOps

How MLOps extends DevOps principles to handle the unique challenges of data, model quality, and concept drift that traditional software CI/CD cannot address.

Model Cards and Documentation

How to write, automate, and maintain model cards that document model capabilities, limitations, training data, fairness evaluations, and regulatory compliance.

Model Evaluation Gates

Design automated model quality gates that block promotion when a model fails on demographic subgroups - not just on aggregate metrics.

Model Performance Monitoring

Monitoring model quality in production - the ground truth delay problem, proxy metrics, shadow evaluation, cohort-based monitoring, SLOs for model quality, and detecting degradation before it hurts the business.

Model Registry Concepts

Understand what a model registry is, why it exists, and how it brings order to the chaos of managing ML models in production.

Model Rollback Strategies

Designing fast, reliable model rollback procedures for when production models degrade - covering registry-based rollback, infrastructure rollback, and automated rollback controllers.

Model Staging and Promotion

How to safely gate model promotion through staging, production, and archiving with automated checks and human approval workflows.

Model Versioning Strategies

Design versioning schemes for ML models that support safe rollbacks, A/B testing, champion/challenger management, and backward compatibility.

Module 02: Experiment Tracking

Systematic tracking of ML experiments - hyperparameters, metrics, artifacts, and models - so your team can reproduce results, compare runs, and ship better models faster.

Module 03: Data Versioning

Versioning datasets as first-class artifacts - DVC, Delta Lake, dataset lineage, data contracts, and managing ML datasets at scale.

Module 05: CI/CD for ML

Build CI/CD pipelines that catch ML-specific failures - not just broken code, but broken models.

Module 06: Containerization

Master Docker and containers for ML - from Dockerfiles to GPU containers, image optimization, and Docker Compose for reproducible ML development environments.

Module 1 - MLOps Foundations

Understand what MLOps is, why it exists, and how to think about operationalizing machine learning systems in production.

Module 10: Cloud ML Platforms

Master AWS SageMaker, Google Vertex AI, Azure ML, Databricks, and cloud cost optimization strategies for production ML systems.

Module 12 - LLMOps Pipelines

Operationalize LLM-based systems - prompt management, evaluation pipelines, observability, RAG operations, and fine-tuning infrastructure.

Module 15 - Cost Management for ML

Financial operations for ML systems - understanding costs, optimizing training and inference, cloud FinOps, build vs. buy analysis, and cost attribution.

Module 8 - Kubernetes for ML

A complete guide to running machine learning workloads on Kubernetes, from fundamentals to GPU scheduling, training jobs, model serving, Helm, and multi-tenant clusters.

Module 9 - Monitoring and Observability

Complete ML monitoring and observability - data drift detection, model performance monitoring, Prometheus/Grafana for ML, distributed tracing, alerting, and production monitoring tools like EvidentlyAI and NannyML.

Multi-Armed Bandits

Use multi-armed bandit algorithms to adaptively allocate traffic during experiments - learning faster than A/B tests while reducing regret.

Numerical and Categorical Features

Systematic feature engineering for tabular data - transformations, encoding, imputation, and selection that lifted AUC from 0.71 to 0.84.

Online Controlled Experiments

Design valid ML experiments by choosing the right randomization unit, handling network effects, detecting novelty, and managing holdout sets.

Optimizing ML Docker Images

Reduce ML Docker images from 8GB to under 1.5GB using multi-stage builds, slim bases, BuildKit cache mounts, and image scanning.

Pods, Deployments, and Services - Deep Dive

Master the three core Kubernetes workload primitives for ML engineers - stateless serving with Deployments, traffic routing with Services, and advanced pod patterns for ML.

Prefect

Building and deploying production ML workflows using Prefect 2.x/3.x - flows, tasks, deployments, work pools, and observability.

Prometheus and Grafana for ML

Building production ML observability infrastructure - Prometheus architecture, custom ML metrics, PromQL for ML, Grafana dashboard design for model serving, and scaling with Thanos for long-term storage.

Prompt Management

Treat prompts as production artifacts - versioning, registry design, testing frameworks, A/B testing prompts, automated optimization with DSPy, and prompt governance.

Pulumi for ML

Write ML infrastructure in real Python - Pulumi's code-first approach, component resources, Automation API, and testing with pytest for reproducible ML platforms.

RAG Pipeline Ops

Operate RAG pipelines in production - index refresh strategies, chunk strategy updates, embedding drift detection, vector database monitoring, and quality tracking.

Reproducibility in ML

Learn the four layers of ML reproducibility - environment, data, code, and model - and how to achieve each in practice with Docker, DVC, MLflow, and seed management.

Service Mesh for ML Serving

Use Istio service mesh to manage traffic routing across multiple ML model versions - canary deployments, A/B testing, circuit breakers, and telemetry.

Shadow Mode Testing

Run new ML models against live production traffic without affecting users - catching silent failures, latency regressions, and behavioral differences before go-live.

Statistical Foundations for A/B Testing

Learn the statistical machinery behind A/B testing - null hypotheses, p-values, power, sample size calculation, and the mistakes that invalidate ML experiments.

Technical Debt in ML Systems

The seven categories of hidden technical debt unique to machine learning systems - entanglement, hidden feedback loops, pipeline jungles, configuration debt, and how to detect and remediate them.

Terraform for ML Infrastructure

Build complete ML platforms with Terraform - GPU clusters, MLflow, EKS, feature stores, and model registries using production-grade HCL modules.

Terraform Fundamentals

Master Terraform core concepts - providers, resources, state management, modules, and the plan/apply lifecycle for building reproducible ML infrastructure.

Testing ML Code

Build a practical ML test suite from zero - covering the full pyramid from unit tests through model validation without testing everything.

Text Features for ML

Turning text into ML features - from TF-IDF baselines to embedding-based representations that improved e-commerce search NDCG by 18%.

The ML Lifecycle

The complete end-to-end lifecycle of a machine learning model, from problem definition through deployment, monitoring, and eventual retirement - with feedback loops, governance, and retraining triggers.

The MLOps Lifecycle

Understand the end-to-end MLOps lifecycle, maturity levels 0–3, the nine components of production ML, and why ML deployment is categorically different from software deployment.

Time-Series Features

Feature engineering for temporal data - lag features, rolling statistics, Fourier seasonality, and preventing temporal leakage that destroys production forecasts.

Token Cost Monitoring

Monitor and control LLM API costs in production - cost-per-request dashboards, budget alerts, token efficiency optimization, cost attribution by feature and user, and anomaly detection.

Training Cost Optimization

Reducing ML training costs systematically - spot instances, mixed precision, gradient checkpointing, compute-optimal training (Chinchilla), and distributed training overhead.

Training Jobs on Kubernetes

Running ML training on Kubernetes - Jobs, CronJobs, PyTorchJob and TFJob with the Training Operator, fault tolerance, checkpoint-based recovery, spot node handling, and distributed training patterns.

Weights & Biases Deep Dive

W&B for production ML teams - run tracking, sweeps, artifact versioning, collaborative reports, alerts, and how it compares to MLflow.

Why Data Versioning

The case for treating datasets as first-class versioned artifacts - regulatory requirements, reproducibility, drift detection, and the approaches to versioning (full copy, delta, pointer).

Why Experiment Tracking

The business and technical case for tracking every ML experiment - what to track, why it matters, and what happens when you don't.

ZenML

Building portable, stack-agnostic MLOps pipelines with ZenML - stacks, steps, materializers, and seamless local-to-cloud migration with MLflow and Vertex AI.