Cost and Performance Trade-offs in Data Infrastructure
How to reason about the latency-throughput-cost triangle, diagnose expensive Spark jobs, optimize cloud data costs with partitioning and caching, and fix data skew that silently kills pipeline performance.
Data Engineering with Python
The complete Python toolkit for data engineering - pandas memory optimization, PyArrow columnar processing, DuckDB analytical SQL, Polars lazy evaluation, and pipeline testing with pandera.
Data Modelling for ML
How to design data models for machine learning - point-in-time correctness, entity-centric tables, SCD Type 2, label leakage prevention, and the training-serving skew problem.
Data Pipeline Patterns for AI/ML Workflows
ETL vs ELT, Lambda vs Kappa architecture, idempotency, exactly-once semantics, backfill strategies, watermarking for late data, and how to design pipelines that reliably serve both model training and real-time inference.
Data Serialization and Schemas
Why serialization format is an architectural decision - JSON vs Protocol Buffers vs Avro, schema evolution strategies, and how Confluent Schema Registry prevents breaking production pipelines.
Storage Formats for ML Training Data
Why Parquet, Avro, ORC, and Delta Lake exist, how columnar storage enables fast ML pipelines, and how to tune storage formats for maximum throughput and minimum cost.
The Data Engineering Landscape for AI Teams
What data engineers actually do in AI organizations, how data flows from raw sources to model serving, and when the data layer becomes the bottleneck for machine learning.