72 docs tagged with "data-engineering"

Advanced Spark Performance Tuning for ML Workloads

Systematic techniques to diagnose and eliminate Spark bottlenecks - data skew, shuffle overhead, memory pressure, and suboptimal joins - reducing job time and cost by 10x.

Airflow for ML Pipelines

Orchestrate ML training pipelines with Airflow - data quality gates, KubernetesPodOperator training, champion/challenger evaluation, and conditional deployment.

Anomaly Detection in Pipelines

Statistical anomaly detection for data drift, schema drift, and volume changes.

Apache Airflow Architecture

Deep dive into Apache Airflow - DAGs, Scheduler internals, Executors, Operators, XCom, and production patterns for reliable pipeline orchestration.

Apache Flink Fundamentals

Apache Flink for stateful stream processing - DataStream API, windows, watermarks, state backends, checkpointing, and PyFlink for ML feature computation.

Apache Hudi

Hudi's copy-on-write vs merge-on-read and upsert patterns.

Apache Iceberg

Iceberg table format, ACID transactions, schema evolution, and time travel.

Apache Kafka Architecture - The Nervous System of Real-Time ML

A deep dive into Kafka's distributed commit log, partitions, replication, consumer groups, compacted topics, and the architectural decisions that make it the standard event transport for production ML systems.

Apache Spark Architecture

How Spark's distributed execution model works - RDDs, DataFrames, DAG planning, Catalyst optimization, and Tungsten execution - explained for engineers building ML data pipelines.

AWS Data Services

S3, Glue, Athena, EMR, and the AWS data engineering ecosystem.

Batch Orchestration Patterns for ML Pipelines

How to orchestrate complex batch ML pipelines with Airflow and modern alternatives, eliminating cron's silent failures, missing dependencies, and zero visibility.

Choosing an Orchestrator for Your AI Data Stack

What Airflow, Prefect, Dagster, and Temporal each do for AI systems, when your ML pipeline complexity and team maturity dictate which orchestrator fits best, and how to apply a structured decision framework to select the right tool for production AI data pipelines.

Cost and Performance Trade-offs in Data Infrastructure

How to reason about the latency-throughput-cost triangle, diagnose expensive Spark jobs, optimize cloud data costs with partitioning and caching, and fix data skew that silently kills pipeline performance.

Custom Data Monitoring

Building custom monitoring with Great Expectations and statistical tests.

Dagster for Data Assets

Asset-based orchestration, software-defined assets, and Dagster's lineage model.

Data Catalog and Discovery

Apache Atlas, DataHub, Amundsen - cataloguing data for ML teams.

Data Engineering for AI

The data infrastructure foundation for AI/ML systems - batch and stream processing, feature stores, data lakehouse, pipeline orchestration, and real-time feature engineering.

Data Engineering with Python

The complete Python toolkit for data engineering - pandas memory optimization, PyArrow columnar processing, DuckDB analytical SQL, Polars lazy evaluation, and pipeline testing with pandera.

Data Governance for AI Training Datasets

What column-level security, data lineage, and cataloguing do for AI systems, when regulated AI training data requires auditability and access controls across the lakehouse, and how to implement governance with Apache Atlas and Unity Catalog in production AI data pipelines.

Data Incident Management

Runbooks, on-call rotations, and root cause analysis for data incidents.

Data Lake vs Warehouse vs Lakehouse for AI Workloads

What each storage architecture does for AI systems, when ML teams need both raw unstructured data and structured query access on the same platform, and how to choose and implement the right architecture in production AI data pipelines.

Data Lineage

Column-level lineage, impact analysis, and tools like OpenLineage and DataHub.

Data Modelling for ML

How to design data models for machine learning - point-in-time correctness, entity-centric tables, SCD Type 2, label leakage prevention, and the training-serving skew problem.

Data Pipeline Patterns for AI/ML Workflows

ETL vs ELT, Lambda vs Kappa architecture, idempotency, exactly-once semantics, backfill strategies, watermarking for late data, and how to design pipelines that reliably serve both model training and real-time inference.

Data Platform Cost Optimisation for AI Teams

What query optimisation, storage tiering, and cloud cost controls do for AI systems, when large-scale model training and feature computation drive unpredictable cloud spend, and how to implement cost reduction strategies in production AI data pipelines.

Data Quality Dimensions That Determine Model Quality

A deep engineering dive into the five dimensions of data quality - completeness, accuracy, consistency, timeliness, and uniqueness - and how each one silently corrupts AI systems in production.

Data Quality for ML

How poor data quality degrades ML model performance - detection and remediation.

Data Serialization and Schemas

Why serialization format is an architectural decision - JSON vs Protocol Buffers vs Avro, schema evolution strategies, and how Confluent Schema Registry prevents breaking production pipelines.

Data SLAs and Incident Response

Defining data SLAs, monitoring, alerting, and runbooks for data incidents.

Databricks

Databricks Lakehouse, Unity Catalog, MLflow integration, and AutoML.

dbt Advanced Patterns for ML Teams

Advanced dbt techniques for large-scale ML pipelines - snapshots for SCD2, point-in-time correct features, slim CI, dbt-utils macros, and production deployment patterns.

dbt for ML Feature Preparation

How dbt brings lineage, testing, documentation, and version control to SQL-based ML data pipelines, replacing fragile cron-driven script chains.

dbt Tests for Quality

Schema tests, custom tests, and data quality gates in dbt pipelines.

Delta Lake

Delta Lake on Databricks, merge operations, and Change Data Capture.

Embedding Stores

Storing and serving dense embeddings at scale for real-time recommendation and search.

Feature Consistency

Ensuring identical features between training (offline) and serving (online).

Feature Monitoring

Detecting feature drift, staleness, and coverage gaps in production.

Feature Store Architecture

Online store, offline store, feature registry, and the dual-write pattern.

Five Pillars of Data Observability for ML Systems

What freshness, distribution, volume, schema, and lineage tracking do for AI systems, when silent data drift and pipeline failures silently corrupt model inputs and degrade predictions, and how to instrument these five pillars in production AI data pipelines.

Google BigQuery

BigQuery architecture, ML built-in functions, and BigQuery ML.

Great Expectations

Writing data expectations, validations, and building a data quality suite.

Kafka for ML Systems

Using Apache Kafka as the backbone of production ML systems - schema registry, CDC, exactly-once semantics, and dead letter queues.

Kafka Streams vs Apache Flink - The ML Pipeline Decision Guide

A comprehensive comparison of Kafka Streams, Faust, and Apache Flink for building real-time ML feature pipelines, with a production decision framework and working code examples.

Lakehouse for ML Workflows

Storing training datasets, experiment artifacts, and model outputs in a lakehouse.

Lakehouse Query Engines

Trino, DuckDB, Spark SQL - querying open table formats at scale.

Low-Latency Feature Serving

Redis, Cassandra, and in-memory stores for sub-millisecond feature retrieval.

Module 3: Stream Processing for Real-Time AI

Eight lessons covering Apache Kafka, Apache Flink, stream processing patterns, real-time feature computation, and production reliability for ML systems that cannot tolerate batch latency.

Monte Carlo and Observability Platforms

Monte Carlo, Bigeye, and Soda - managed data observability.

Multi-Cloud Data Strategies for AI Workloads

What multi-cloud data architectures do for AI systems, when vendor lock-in and data gravity risks threaten the portability of ML training and serving infrastructure, and how to design resilient multi-cloud strategies for production AI data pipelines.

Online vs Offline Features

The fundamental split between pre-computed offline and real-time online features.

Orchestration Patterns for End-to-End ML Pipelines

What dynamic DAGs, sensors, and fan-out/fan-in patterns do for AI systems, when ML workflows require data-aware scheduling and conditional branching across training and serving stages, and how to apply these patterns in production AI data pipelines.

Overview

Overview of cloud data platforms for AI and ML workloads.

Overview

Module overview for Pipeline Orchestration - turning ad-hoc scripts into reliable, observable, recoverable production data pipelines.

Overview

Overview of real-time feature engineering for low-latency ML systems.

Point-in-Time Correctness

Time-travel queries, point-in-time joins, and preventing data leakage.

Prefect and Modern Orchestration

Prefect orchestration deep dive - flows, tasks, deployments, work pools, automations, and a direct comparison with Apache Airflow.

Production Patterns

Case studies in real-time feature engineering from Uber, Twitter, and LinkedIn.

Real-Time Aggregations

Windowed aggregations, sessionisation, and user behaviour features in real time.

Real-Time Feature Computation for ML Inference

How to build streaming feature pipelines that compute fresh ML features at production scale, including dual-store architecture, training-serving skew prevention, and hot key mitigation.

Retail Data Engineering

POS data streams, customer data platform architecture, real-time feature computation with Flink, medallion data lake architecture for retail, privacy compliance, and event streaming pipelines for retail ML.

Snowflake for ML

Snowflake architecture, Snowpark, and ML feature serving from Snowflake.

Spark for ML Pipelines

Building production ML feature pipelines with PySpark - window functions, Pandas UDFs, MLlib Pipelines, point-in-time joins, and Delta Lake integration.

SQL at Scale for ML Feature Engineering

Writing production SQL for 10-billion-row datasets - partition pruning, window functions, approximate aggregates, BigQuery optimization, DuckDB, and Spark SQL patterns for ML feature preparation.

Storage Formats for ML Training Data

Why Parquet, Avro, ORC, and Delta Lake exist, how columnar storage enables fast ML pipelines, and how to tune storage formats for maximum throughput and minimum cost.

Stream Processing Patterns for ML Pipelines

Seven production design patterns for streaming ML pipelines - stream enrichment, stream-stream joins, CDC to feature store, streaming inference, feedback loops, and exactly-once end-to-end.

Stream-to-Feature Pipelines

Computing features from event streams with Kafka and Flink.

Streaming Concepts - Why Batch Fails for Real-Time ML

The fundamental theory of stream processing - event time, processing time, watermarks, windowing, delivery semantics, and backpressure - through the lens of ML systems that cannot afford batch latency.

Streaming Pipeline Reliability for ML Systems

How to build streaming ML pipelines that survive failures, handle schema changes, implement dead letter queues, replay events, and monitor themselves - so your fraud model never runs on 3-hour-old features again.

Testing and Monitoring Pipelines

Unit testing DAGs, SLA monitoring, and alerting on pipeline failures.

Testing Data Pipelines for ML Correctness

How to test batch ML data pipelines with unit tests, integration tests, and data quality checks - catching label leakage, schema drift, and idempotency bugs before they corrupt your models.

The Data Engineering Landscape for AI Teams

What data engineers actually do in AI organizations, how data flows from raw sources to model serving, and when the data layer becomes the bottleneck for machine learning.

Why Feature Stores Exist

Training-serving skew, feature reuse, and the operational challenges feature stores solve.