Skip to main content

Module 10: Cloud ML Platforms

Every ML team eventually faces the same question: where does the model actually run? On whose servers, using whose networking, paying whose bill? The cloud platforms module answers that question - not with marketing copy, but with architecture diagrams, real cost calculations, and hard-won patterns from teams that have moved petabytes of data and trained thousands of models across AWS, GCP, Azure, and Databricks.

What This Module Covers

Lessons in This Module

#LessonWhat You'll Learn
01AWS SageMakerEnd-to-end ML on AWS: training jobs, pipelines, model registry, real-time and batch inference
02Google Vertex AIVertex AI platform: managed training, TPU access, Feature Store, Matching Engine
03Azure MLAzure ML workspace: compute clusters, pipelines, CLI v2, DevOps integration
04Databricks for MLOpsLakehouse architecture, Delta Lake, MLflow Unity Catalog, Feature Store
05Cloud Cost OptimizationSpot instances, reserved capacity, right-sizing, multi-cloud cost arbitrage

Key Concepts You'll Master

  • Managed vs self-managed: When to use a managed ML platform vs rolling your own on Kubernetes
  • Training infrastructure: Distributed training, spot/preemptible instances, checkpointing strategies
  • Serving infrastructure: Real-time endpoints, batch transform, serverless inference, autoscaling
  • Feature stores: Online vs offline stores, the training-serving skew problem they solve
  • Model registries: Versioning, lineage, approval workflows, deployment targets
  • Cost levers: The 5 decisions that determine 80% of your cloud ML bill

Why Cloud Platforms Matter for MLOps

Running ML in production is not the same as running ML on a laptop. The cloud platforms covered in this module provide managed infrastructure for every stage of the ML lifecycle - but they come with their own complexity, pricing models, and lock-in tradeoffs. Understanding these platforms deeply means being able to make architectural decisions that will affect your team's productivity and your company's costs for years.

:::tip Platform Selection Reality Most teams don't choose their ML platform - they inherit it from their company's existing cloud contract. Understanding all three major platforms (AWS, GCP, Azure) plus Databricks gives you the vocabulary to contribute to these decisions and optimize within whatever environment you land in. :::

Prerequisites

  • Familiarity with Docker containers and container registries
  • Basic understanding of ML training and inference workflows
  • Completion of Module 8 (ML Pipelines) recommended
  • AWS/GCP/Azure free tier accounts for hands-on exercises
© 2026 EngineersOfAI. All rights reserved.