Skip to main content

72 docs tagged with "data-engineering"

View all tags

Airflow for ML Pipelines

Orchestrate ML training pipelines with Airflow - data quality gates, KubernetesPodOperator training, champion/challenger evaluation, and conditional deployment.

Apache Airflow Architecture

Deep dive into Apache Airflow - DAGs, Scheduler internals, Executors, Operators, XCom, and production patterns for reliable pipeline orchestration.

Apache Flink Fundamentals

Apache Flink for stateful stream processing - DataStream API, windows, watermarks, state backends, checkpointing, and PyFlink for ML feature computation.

Apache Hudi

Hudi's copy-on-write vs merge-on-read and upsert patterns.

Apache Iceberg

Iceberg table format, ACID transactions, schema evolution, and time travel.

Apache Spark Architecture

How Spark's distributed execution model works - RDDs, DataFrames, DAG planning, Catalyst optimization, and Tungsten execution - explained for engineers building ML data pipelines.

Choosing an Orchestrator for Your AI Data Stack

What Airflow, Prefect, Dagster, and Temporal each do for AI systems, when your ML pipeline complexity and team maturity dictate which orchestrator fits best, and how to apply a structured decision framework to select the right tool for production AI data pipelines.

Data Engineering for AI

The data infrastructure foundation for AI/ML systems - batch and stream processing, feature stores, data lakehouse, pipeline orchestration, and real-time feature engineering.

Data Engineering with Python

The complete Python toolkit for data engineering - pandas memory optimization, PyArrow columnar processing, DuckDB analytical SQL, Polars lazy evaluation, and pipeline testing with pandera.

Data Governance for AI Training Datasets

What column-level security, data lineage, and cataloguing do for AI systems, when regulated AI training data requires auditability and access controls across the lakehouse, and how to implement governance with Apache Atlas and Unity Catalog in production AI data pipelines.

Data Lake vs Warehouse vs Lakehouse for AI Workloads

What each storage architecture does for AI systems, when ML teams need both raw unstructured data and structured query access on the same platform, and how to choose and implement the right architecture in production AI data pipelines.

Data Lineage

Column-level lineage, impact analysis, and tools like OpenLineage and DataHub.

Data Modelling for ML

How to design data models for machine learning - point-in-time correctness, entity-centric tables, SCD Type 2, label leakage prevention, and the training-serving skew problem.

Data Pipeline Patterns for AI/ML Workflows

ETL vs ELT, Lambda vs Kappa architecture, idempotency, exactly-once semantics, backfill strategies, watermarking for late data, and how to design pipelines that reliably serve both model training and real-time inference.

Data Platform Cost Optimisation for AI Teams

What query optimisation, storage tiering, and cloud cost controls do for AI systems, when large-scale model training and feature computation drive unpredictable cloud spend, and how to implement cost reduction strategies in production AI data pipelines.

Data Quality for ML

How poor data quality degrades ML model performance - detection and remediation.

Data Serialization and Schemas

Why serialization format is an architectural decision - JSON vs Protocol Buffers vs Avro, schema evolution strategies, and how Confluent Schema Registry prevents breaking production pipelines.

Databricks

Databricks Lakehouse, Unity Catalog, MLflow integration, and AutoML.

dbt Advanced Patterns for ML Teams

Advanced dbt techniques for large-scale ML pipelines - snapshots for SCD2, point-in-time correct features, slim CI, dbt-utils macros, and production deployment patterns.

dbt for ML Feature Preparation

How dbt brings lineage, testing, documentation, and version control to SQL-based ML data pipelines, replacing fragile cron-driven script chains.

Delta Lake

Delta Lake on Databricks, merge operations, and Change Data Capture.

Embedding Stores

Storing and serving dense embeddings at scale for real-time recommendation and search.

Feature Consistency

Ensuring identical features between training (offline) and serving (online).

Five Pillars of Data Observability for ML Systems

What freshness, distribution, volume, schema, and lineage tracking do for AI systems, when silent data drift and pipeline failures silently corrupt model inputs and degrade predictions, and how to instrument these five pillars in production AI data pipelines.

Google BigQuery

BigQuery architecture, ML built-in functions, and BigQuery ML.

Great Expectations

Writing data expectations, validations, and building a data quality suite.

Kafka for ML Systems

Using Apache Kafka as the backbone of production ML systems - schema registry, CDC, exactly-once semantics, and dead letter queues.

Module 3: Stream Processing for Real-Time AI

Eight lessons covering Apache Kafka, Apache Flink, stream processing patterns, real-time feature computation, and production reliability for ML systems that cannot tolerate batch latency.

Multi-Cloud Data Strategies for AI Workloads

What multi-cloud data architectures do for AI systems, when vendor lock-in and data gravity risks threaten the portability of ML training and serving infrastructure, and how to design resilient multi-cloud strategies for production AI data pipelines.

Orchestration Patterns for End-to-End ML Pipelines

What dynamic DAGs, sensors, and fan-out/fan-in patterns do for AI systems, when ML workflows require data-aware scheduling and conditional branching across training and serving stages, and how to apply these patterns in production AI data pipelines.

Overview

Overview of cloud data platforms for AI and ML workloads.

Overview

Module overview for Pipeline Orchestration - turning ad-hoc scripts into reliable, observable, recoverable production data pipelines.

Overview

Overview of real-time feature engineering for low-latency ML systems.

Prefect and Modern Orchestration

Prefect orchestration deep dive - flows, tasks, deployments, work pools, automations, and a direct comparison with Apache Airflow.

Production Patterns

Case studies in real-time feature engineering from Uber, Twitter, and LinkedIn.

Real-Time Feature Computation for ML Inference

How to build streaming feature pipelines that compute fresh ML features at production scale, including dual-store architecture, training-serving skew prevention, and hot key mitigation.

Retail Data Engineering

POS data streams, customer data platform architecture, real-time feature computation with Flink, medallion data lake architecture for retail, privacy compliance, and event streaming pipelines for retail ML.

Snowflake for ML

Snowflake architecture, Snowpark, and ML feature serving from Snowflake.

Spark for ML Pipelines

Building production ML feature pipelines with PySpark - window functions, Pandas UDFs, MLlib Pipelines, point-in-time joins, and Delta Lake integration.

SQL at Scale for ML Feature Engineering

Writing production SQL for 10-billion-row datasets - partition pruning, window functions, approximate aggregates, BigQuery optimization, DuckDB, and Spark SQL patterns for ML feature preparation.

Storage Formats for ML Training Data

Why Parquet, Avro, ORC, and Delta Lake exist, how columnar storage enables fast ML pipelines, and how to tune storage formats for maximum throughput and minimum cost.

Stream Processing Patterns for ML Pipelines

Seven production design patterns for streaming ML pipelines - stream enrichment, stream-stream joins, CDC to feature store, streaming inference, feedback loops, and exactly-once end-to-end.

Streaming Pipeline Reliability for ML Systems

How to build streaming ML pipelines that survive failures, handle schema changes, implement dead letter queues, replay events, and monitor themselves - so your fraud model never runs on 3-hour-old features again.

Testing Data Pipelines for ML Correctness

How to test batch ML data pipelines with unit tests, integration tests, and data quality checks - catching label leakage, schema drift, and idempotency bugs before they corrupt your models.