Advanced Spark Performance Tuning for ML Workloads
Systematic techniques to diagnose and eliminate Spark bottlenecks - data skew, shuffle overhead, memory pressure, and suboptimal joins - reducing job time and cost by 10x.
Systematic techniques to diagnose and eliminate Spark bottlenecks - data skew, shuffle overhead, memory pressure, and suboptimal joins - reducing job time and cost by 10x.
Orchestrate ML training pipelines with Airflow - data quality gates, KubernetesPodOperator training, champion/challenger evaluation, and conditional deployment.
Statistical anomaly detection for data drift, schema drift, and volume changes.
Deep dive into Apache Airflow - DAGs, Scheduler internals, Executors, Operators, XCom, and production patterns for reliable pipeline orchestration.
Apache Flink for stateful stream processing - DataStream API, windows, watermarks, state backends, checkpointing, and PyFlink for ML feature computation.
Hudi's copy-on-write vs merge-on-read and upsert patterns.
Iceberg table format, ACID transactions, schema evolution, and time travel.
A deep dive into Kafka's distributed commit log, partitions, replication, consumer groups, compacted topics, and the architectural decisions that make it the standard event transport for production ML systems.
How Spark's distributed execution model works - RDDs, DataFrames, DAG planning, Catalyst optimization, and Tungsten execution - explained for engineers building ML data pipelines.
S3, Glue, Athena, EMR, and the AWS data engineering ecosystem.
How to orchestrate complex batch ML pipelines with Airflow and modern alternatives, eliminating cron's silent failures, missing dependencies, and zero visibility.
What Airflow, Prefect, Dagster, and Temporal each do for AI systems, when your ML pipeline complexity and team maturity dictate which orchestrator fits best, and how to apply a structured decision framework to select the right tool for production AI data pipelines.
How to reason about the latency-throughput-cost triangle, diagnose expensive Spark jobs, optimize cloud data costs with partitioning and caching, and fix data skew that silently kills pipeline performance.
Building custom monitoring with Great Expectations and statistical tests.
Asset-based orchestration, software-defined assets, and Dagster's lineage model.
Apache Atlas, DataHub, Amundsen - cataloguing data for ML teams.
The data infrastructure foundation for AI/ML systems - batch and stream processing, feature stores, data lakehouse, pipeline orchestration, and real-time feature engineering.
The complete Python toolkit for data engineering - pandas memory optimization, PyArrow columnar processing, DuckDB analytical SQL, Polars lazy evaluation, and pipeline testing with pandera.
What column-level security, data lineage, and cataloguing do for AI systems, when regulated AI training data requires auditability and access controls across the lakehouse, and how to implement governance with Apache Atlas and Unity Catalog in production AI data pipelines.
Runbooks, on-call rotations, and root cause analysis for data incidents.
What each storage architecture does for AI systems, when ML teams need both raw unstructured data and structured query access on the same platform, and how to choose and implement the right architecture in production AI data pipelines.
Column-level lineage, impact analysis, and tools like OpenLineage and DataHub.
How to design data models for machine learning - point-in-time correctness, entity-centric tables, SCD Type 2, label leakage prevention, and the training-serving skew problem.
ETL vs ELT, Lambda vs Kappa architecture, idempotency, exactly-once semantics, backfill strategies, watermarking for late data, and how to design pipelines that reliably serve both model training and real-time inference.
What query optimisation, storage tiering, and cloud cost controls do for AI systems, when large-scale model training and feature computation drive unpredictable cloud spend, and how to implement cost reduction strategies in production AI data pipelines.
A deep engineering dive into the five dimensions of data quality - completeness, accuracy, consistency, timeliness, and uniqueness - and how each one silently corrupts AI systems in production.
How poor data quality degrades ML model performance - detection and remediation.
Why serialization format is an architectural decision - JSON vs Protocol Buffers vs Avro, schema evolution strategies, and how Confluent Schema Registry prevents breaking production pipelines.
Defining data SLAs, monitoring, alerting, and runbooks for data incidents.
Databricks Lakehouse, Unity Catalog, MLflow integration, and AutoML.
Advanced dbt techniques for large-scale ML pipelines - snapshots for SCD2, point-in-time correct features, slim CI, dbt-utils macros, and production deployment patterns.
How dbt brings lineage, testing, documentation, and version control to SQL-based ML data pipelines, replacing fragile cron-driven script chains.
Schema tests, custom tests, and data quality gates in dbt pipelines.
Delta Lake on Databricks, merge operations, and Change Data Capture.
Storing and serving dense embeddings at scale for real-time recommendation and search.
Ensuring identical features between training (offline) and serving (online).
Detecting feature drift, staleness, and coverage gaps in production.
Online store, offline store, feature registry, and the dual-write pattern.
What freshness, distribution, volume, schema, and lineage tracking do for AI systems, when silent data drift and pipeline failures silently corrupt model inputs and degrade predictions, and how to instrument these five pillars in production AI data pipelines.
BigQuery architecture, ML built-in functions, and BigQuery ML.
Writing data expectations, validations, and building a data quality suite.
Using Apache Kafka as the backbone of production ML systems - schema registry, CDC, exactly-once semantics, and dead letter queues.
A comprehensive comparison of Kafka Streams, Faust, and Apache Flink for building real-time ML feature pipelines, with a production decision framework and working code examples.
Storing training datasets, experiment artifacts, and model outputs in a lakehouse.
Trino, DuckDB, Spark SQL - querying open table formats at scale.
Redis, Cassandra, and in-memory stores for sub-millisecond feature retrieval.
Eight lessons covering Apache Kafka, Apache Flink, stream processing patterns, real-time feature computation, and production reliability for ML systems that cannot tolerate batch latency.
Monte Carlo, Bigeye, and Soda - managed data observability.
What multi-cloud data architectures do for AI systems, when vendor lock-in and data gravity risks threaten the portability of ML training and serving infrastructure, and how to design resilient multi-cloud strategies for production AI data pipelines.
The fundamental split between pre-computed offline and real-time online features.
What dynamic DAGs, sensors, and fan-out/fan-in patterns do for AI systems, when ML workflows require data-aware scheduling and conditional branching across training and serving stages, and how to apply these patterns in production AI data pipelines.
Overview of cloud data platforms for AI and ML workloads.
Module overview for Pipeline Orchestration - turning ad-hoc scripts into reliable, observable, recoverable production data pipelines.
Overview of real-time feature engineering for low-latency ML systems.
Time-travel queries, point-in-time joins, and preventing data leakage.
Prefect orchestration deep dive - flows, tasks, deployments, work pools, automations, and a direct comparison with Apache Airflow.
Case studies in real-time feature engineering from Uber, Twitter, and LinkedIn.
Windowed aggregations, sessionisation, and user behaviour features in real time.
How to build streaming feature pipelines that compute fresh ML features at production scale, including dual-store architecture, training-serving skew prevention, and hot key mitigation.
POS data streams, customer data platform architecture, real-time feature computation with Flink, medallion data lake architecture for retail, privacy compliance, and event streaming pipelines for retail ML.
Snowflake architecture, Snowpark, and ML feature serving from Snowflake.
Building production ML feature pipelines with PySpark - window functions, Pandas UDFs, MLlib Pipelines, point-in-time joins, and Delta Lake integration.
Writing production SQL for 10-billion-row datasets - partition pruning, window functions, approximate aggregates, BigQuery optimization, DuckDB, and Spark SQL patterns for ML feature preparation.
Why Parquet, Avro, ORC, and Delta Lake exist, how columnar storage enables fast ML pipelines, and how to tune storage formats for maximum throughput and minimum cost.
Seven production design patterns for streaming ML pipelines - stream enrichment, stream-stream joins, CDC to feature store, streaming inference, feedback loops, and exactly-once end-to-end.
Computing features from event streams with Kafka and Flink.
The fundamental theory of stream processing - event time, processing time, watermarks, windowing, delivery semantics, and backpressure - through the lens of ML systems that cannot afford batch latency.
How to build streaming ML pipelines that survive failures, handle schema changes, implement dead letter queues, replay events, and monitor themselves - so your fraud model never runs on 3-hour-old features again.
Unit testing DAGs, SLA monitoring, and alerting on pipeline failures.
How to test batch ML data pipelines with unit tests, integration tests, and data quality checks - catching label leakage, schema drift, and idempotency bugs before they corrupt your models.
What data engineers actually do in AI organizations, how data flows from raw sources to model serving, and when the data layer becomes the bottleneck for machine learning.
Training-serving skew, feature reuse, and the operational challenges feature stores solve.