Skip to main content

7 docs tagged with "de-foundations"

View all tags

Data Engineering with Python

The complete Python toolkit for data engineering - pandas memory optimization, PyArrow columnar processing, DuckDB analytical SQL, Polars lazy evaluation, and pipeline testing with pandera.

Data Modelling for ML

How to design data models for machine learning - point-in-time correctness, entity-centric tables, SCD Type 2, label leakage prevention, and the training-serving skew problem.

Data Pipeline Patterns for AI/ML Workflows

ETL vs ELT, Lambda vs Kappa architecture, idempotency, exactly-once semantics, backfill strategies, watermarking for late data, and how to design pipelines that reliably serve both model training and real-time inference.

Data Serialization and Schemas

Why serialization format is an architectural decision - JSON vs Protocol Buffers vs Avro, schema evolution strategies, and how Confluent Schema Registry prevents breaking production pipelines.

Storage Formats for ML Training Data

Why Parquet, Avro, ORC, and Delta Lake exist, how columnar storage enables fast ML pipelines, and how to tune storage formats for maximum throughput and minimum cost.