Skip to main content

9 docs tagged with "batch-processing"

View all tags

Apache Spark Architecture

How Spark's distributed execution model works - RDDs, DataFrames, DAG planning, Catalyst optimization, and Tungsten execution - explained for engineers building ML data pipelines.

Batch Processing with LLMs

Efficiently processing large document sets with LLM batch APIs - Anthropic Batch API, cost optimization, monitoring, checkpointing, and production patterns for overnight and large-scale LLM workloads.

dbt Advanced Patterns for ML Teams

Advanced dbt techniques for large-scale ML pipelines - snapshots for SCD2, point-in-time correct features, slim CI, dbt-utils macros, and production deployment patterns.

dbt for ML Feature Preparation

How dbt brings lineage, testing, documentation, and version control to SQL-based ML data pipelines, replacing fragile cron-driven script chains.

Spark for ML Pipelines

Building production ML feature pipelines with PySpark - window functions, Pandas UDFs, MLlib Pipelines, point-in-time joins, and Delta Lake integration.

SQL at Scale for ML Feature Engineering

Writing production SQL for 10-billion-row datasets - partition pruning, window functions, approximate aggregates, BigQuery optimization, DuckDB, and Spark SQL patterns for ML feature preparation.

Testing Data Pipelines for ML Correctness

How to test batch ML data pipelines with unit tests, integration tests, and data quality checks - catching label leakage, schema drift, and idempotency bugs before they corrupt your models.