8 docs tagged with "data-infrastructure"

Batch Processing with Spark for ML Pipelines

How Apache Spark processes terabyte-scale training data - architecture, DataFrames, partitioning, joins, and integration with Delta Lake for ML feature engineering.

Data Lake and Data Warehouse for ML

The evolution from database to data lake to lakehouse - when to use each storage architecture for ML training data, feature engineering, and model serving.

Data Quality and Validation for ML

Why data quality is the number-one cause of ML failures in production - Great Expectations, data contracts, PSI distribution monitoring, and pipeline quality gates.

Data Versioning with Delta Lake

ACID transactions, time travel, schema evolution, and training data versioning with Delta Lake - building reproducible ML pipelines on object storage.

Feature Store Architecture

How feature stores solve training-serving skew with a dual-store architecture - offline store for training, online store for serving, and point-in-time correct retrieval.

Lakehouse Architecture for ML

Lakehouse architecture for ML systems - Delta Lake, Apache Iceberg, Apache Hudi, medallion architecture, query engines, and ML pipelines on the lakehouse.

Module 2 - Data Infrastructure

A complete map of the Data Infrastructure module covering data lakes, Spark, Kafka, feature stores, data quality, Delta Lake, and lakehouse architecture for ML.

Stream Processing with Kafka for Real-Time ML

How Apache Kafka and Flink enable real-time ML features - topics, consumer groups, exactly-once semantics, streaming feature computation, and architecture patterns.