Skip to main content

Feature Engineering at Scale

The Pipeline That Worked - Until It Didn't

It is 11 PM on a Tuesday. Your quarterly model refresh is scheduled to start in one hour. The feature engineering pipeline has run cleanly every week for seven months - a comfortable pandas script that reads from S3, applies transformations, and writes feature vectors back. It handles your training dataset of roughly 10 GB without complaint.

Tonight, the dataset is 500 GB. Three quarters of transaction history, newly backfilled from a data warehouse migration. The job starts, memory climbs, and forty-five minutes in, the worker process gets OOM-killed by the kernel. You restart with a bigger instance. Same thing. You add chunking logic in a panic. The results come out wrong - some user cohorts processed twice, some missing entirely.

The model refresh does not happen. The downstream fraud detection system runs on stale features for six days while you rebuild the pipeline from scratch. In the post-mortem, the phrase "we assumed single-machine would be enough" appears three times.

This scenario is not unusual. It is the standard way that data science teams discover that feature engineering is an engineering problem. The pandas prototype worked because the dataset was small. But the mental model behind it - single machine, in-memory, sequential - never scaled.

Feature engineering at scale means rethinking every assumption: where computation happens, how dependencies between features are tracked, whether features are computed in batch or on the fly, and how much each feature costs to produce. The pipeline that fails at 500 GB is not a bad pipeline that needs tuning. It is a pipeline designed for a different class of problem. You need to build a different thing.

This lesson covers the engineering decisions that separate fragile feature scripts from production-grade feature computation systems.


:::tip 🎮 Interactive Playground Visualize this concept: Try the Feature Engineering demo on the EngineersOfAI Playground - no code required. :::

Why This Exists: The Mismatch Between Research and Production

Feature engineering originated as a manual, expert-driven process. In the early days of machine learning, before deep learning automated representation learning for images and text, practitioners spent most of their time crafting features by hand. A credit risk model might take six months to build, with four of those months spent on feature engineering. The practitioner would compute features on a spreadsheet, verify them by hand, and pass them directly to a model fitting routine.

This worked because datasets were small, models were retrained infrequently, and the same person who built the features also trained the model and understood their behavior. There was no pipeline. There was a script, or sometimes a sequence of SQL queries. The "engineering" was in the analyst's head.

The problems emerged when three things changed simultaneously: datasets grew by orders of magnitude, model retraining became continuous rather than quarterly, and the teams who engineered features became separate from the teams who consumed them. A feature computed in a notebook by a data scientist needs to be re-implemented in production by a platform engineer. Or re-implemented in a different language. Or re-computed every hour instead of every month. These transitions are where value leaks out of the system.

Scale-oriented feature engineering exists to solve the production gap: the difference between the feature computation that works in a development environment and the feature computation that works reliably in production, at volume, every day, across many models simultaneously.


Historical Context

The explicit framing of feature engineering as a scalable pipeline problem emerged alongside the growth of large-scale ML systems at technology companies in the 2010s.

Uber's Michelangelo platform (described in a 2017 blog post) was one of the first public descriptions of a feature store architecture designed specifically to bridge training and serving feature computation. The key insight was that the same feature - say, "driver acceptance rate over the last 30 minutes" - needed to be computed identically in the offline batch training job and the online serving path responding to real-time requests.

LinkedIn's Feathr project (open-sourced in 2022) formalized the concept of a feature dependency graph: the idea that features have dependencies on other features, and that those dependencies need to be tracked explicitly so that recomputation can be scheduled correctly when upstream data changes.

Apache Spark became the de facto standard for distributed feature computation around 2016–2018, replacing MapReduce-era tools that were too low-level and Hive jobs that were too slow for iterative development. Spark's DataFrame API provided a sufficiently high-level abstraction that data scientists could write distributed feature transformations without needing to think in terms of map and reduce operations.

The shift from batch-only to batch-plus-streaming feature computation became pressing around 2019–2021, as real-time ML use cases (fraud detection, recommendation, ride pricing) required features computed from events that were seconds or minutes old, not hours or days old.


Core Concepts

Batch vs. Streaming vs. On-Demand Features

Not all features have the same freshness requirements. Understanding this distinction drives the architecture of your entire feature system.

Batch features are computed on a schedule - hourly, daily, weekly. They read from a historical data store (data warehouse, data lake), apply transformations, and write to a feature store's offline layer. The classic example: "total spend by this customer in the last 30 days." Computing this requires scanning 30 days of transaction history. You cannot do it in real-time for every inference request. You compute it nightly, store the result, and serve the pre-computed value.

Streaming features are computed from event streams in near-real-time. The classic example: "number of failed login attempts in the last 5 minutes." Waiting until the next batch job would make this feature stale and useless for fraud detection. Instead, you read from a Kafka topic, apply windowed aggregations, and write to a low-latency feature store (Redis, DynamoDB, Cassandra) that can serve the value in milliseconds.

On-demand features are computed at inference time from the request payload itself. The classic example: "character count of the search query." There is no meaningful way to pre-compute this - it depends entirely on what the user typed. These features are computed in the serving path, not in a pipeline.

The decision of which category a feature belongs to is one of the most important architectural decisions in feature engineering. Getting it wrong costs money (computing batch features in real-time is expensive) or accuracy (using stale batch features where you needed streaming).

Feature Dependency Graphs

Features are not always independent. A feature like "7-day rolling average spend" depends on "daily spend," which depends on "raw transaction amounts." When upstream data changes, you need to know which downstream features to recompute.

A feature dependency graph makes these relationships explicit and machine-readable. It is a directed acyclic graph (DAG) where nodes are features (or intermediate computations) and edges represent data dependencies.

Without a dependency graph, recomputation is manual and error-prone. Engineers grep through code to find what depends on what, miss something, and serve a partially stale feature set. With a dependency graph, a scheduler can automatically determine which features need recomputation when a specific upstream dataset is updated.

Distributed Computation with Spark

The transition from pandas to Spark is the most common scaling intervention for feature pipelines. It is not always the right intervention - for datasets under ~50 GB, optimized pandas or DuckDB may be more cost-effective - but it is the standard solution for datasets in the hundreds of gigabytes to terabytes range.

The key mental shift: Spark operates on partitioned data. Each transformation is applied independently to each partition. This means:

  1. Partitioning strategy matters. If you partition by user_id, aggregations within a user are local. If you need to aggregate across users, you trigger a shuffle. Shuffles are expensive - they move data across the network.

  2. Spark is lazy. Transformations build a logical plan; computation happens only at an action (.write(), .collect(), .count()). This allows the optimizer to reorder and fuse operations.

  3. Feature pipelines must be idempotent. A distributed job may restart partitions on failure. If your feature computation has side effects (like incrementing a counter), restarts will produce wrong answers.

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window

spark = SparkSession.builder \
.appName("feature-engineering") \
.config("spark.sql.adaptive.enabled", "true") \
.config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
.getOrCreate()

# Read raw transactions from S3
transactions = spark.read.parquet("s3://data-lake/transactions/")

# Partition window for per-user rolling calculations
user_window_7d = Window.partitionBy("user_id") \
.orderBy(F.unix_timestamp("txn_date")) \
.rangeBetween(-7 * 86400, 0) # 7 days in seconds

user_window_30d = Window.partitionBy("user_id") \
.orderBy(F.unix_timestamp("txn_date")) \
.rangeBetween(-30 * 86400, 0)

# Build features as a single Spark plan - no intermediate .collect()
features = transactions.groupBy("user_id", "txn_date").agg(
F.sum("amount").alias("daily_spend"),
F.count("*").alias("daily_txn_count"),
F.mean("amount").alias("daily_avg_amount")
).withColumn(
"spend_7d_avg",
F.avg("daily_spend").over(user_window_7d)
).withColumn(
"spend_30d_avg",
F.avg("daily_spend").over(user_window_30d)
).withColumn(
"txn_count_7d",
F.sum("daily_txn_count").over(user_window_7d)
).withColumn(
"spend_velocity", # rate of spend acceleration
F.col("spend_7d_avg") / F.nullif(F.col("spend_30d_avg"), 0)
)

# Write partitioned by date so downstream jobs can read incrementally
features.write \
.mode("overwrite") \
.partitionBy("txn_date") \
.parquet("s3://feature-store/user-spend-features/")

Feature Compute Cost Management

Feature computation is not free. Every feature you add to the pipeline has a cost: compute time, memory, I/O, and storage. In large-scale systems, this cost compounds quickly.

The key insight is that not all features justify their compute cost. A feature that requires scanning 365 days of raw events and costs $2,000/month to compute, but contributes 0.001 AUC to the model, should be retired. Feature engineers rarely think about this; MLOps engineers must.

Practical cost management strategies:

Incremental computation: Instead of recomputing all features from scratch daily, compute only the delta. If yesterday's rolling-7-day features are already computed, today's job only needs to process new transactions and slide the window.

Feature tiering: Separate high-freshness features (computed every 5 minutes, stored in Redis) from low-freshness features (computed nightly, stored in S3). Do not compute all features on all schedules.

Pre-computation vs. just-in-time: For features that are expensive to compute but used by many models, pre-compute and cache. For features that are cheap to compute but rarely needed, compute on demand.

Feature deprecation: Track which features are actually read by live models. Features that have not been consumed in 90 days should be reviewed for retirement. Unused feature pipelines cost money and create maintenance burden.

# Cost tracking: measure Spark job resource usage per feature group
from dataclasses import dataclass
from typing import Dict
import time

@dataclass
class FeaturePipelineCost:
feature_group: str
spark_dbu_hours: float # Databricks billing units
data_read_gb: float
data_written_gb: float
runtime_minutes: float
monthly_compute_usd: float # estimated

def estimate_monthly_cost(
runtime_minutes: float,
cluster_cost_per_hour: float,
runs_per_month: int
) -> float:
"""Estimate monthly compute cost for a feature pipeline."""
hours_per_run = runtime_minutes / 60.0
cost_per_run = hours_per_run * cluster_cost_per_hour
return cost_per_run * runs_per_month

# Example: tracking feature group costs
feature_costs: Dict[str, FeaturePipelineCost] = {
"user_spend_features": FeaturePipelineCost(
feature_group="user_spend_features",
spark_dbu_hours=2.4,
data_read_gb=180.0,
data_written_gb=12.0,
runtime_minutes=47.0,
monthly_compute_usd=estimate_monthly_cost(
runtime_minutes=47.0,
cluster_cost_per_hour=3.20, # Databricks + EC2
runs_per_month=30
)
),
}

for name, cost in feature_costs.items():
print(f"{name}: ${cost.monthly_compute_usd:.0f}/month, "
f"{cost.runtime_minutes:.0f} min/run")

Feature Engineering for Different Data Types

Different data modalities require fundamentally different engineering approaches. A pipeline built for tabular numerical data will fail when applied to text or time series.

Tabular / structured data: Scaling, encoding, imputation, binning. Covered in depth in Lesson 03.

Time series: Lag features, rolling statistics, Fourier transforms for seasonality. Covered in Lesson 04.

Text: TF-IDF, embeddings, text cleaning, dimensionality reduction. Covered in Lesson 05.

Images: Pre-computed embeddings (ResNet, CLIP), spatial statistics, metadata features. The image itself is rarely fed raw into a feature store; a 512-dimensional embedding vector is.

Graph / relational: Network features (degree, centrality, community membership), derived from relationships between entities. Expensive to compute at scale; requires graph processing frameworks (GraphX, NetworkX on sampled data).


Production Engineering Notes

Handling Pipeline Failures at Scale

Distributed feature pipelines fail differently than single-machine scripts. Common failure modes:

  • Partial writes: Job completes 60% before OOM kill. Downstream reads mix new and old data.
  • Skewed partitions: 80% of data is in one Spark partition. That task runs 10× longer than others. The job appears stuck.
  • Silent data loss: A filter bug excludes 5% of users. No error is thrown. Model silently degrades.

Mitigations:

  • Write to a staging path, validate, then atomically swap to production path
  • Use Spark's AQE (Adaptive Query Execution) to automatically handle skew
  • Add record count assertions before and after every major transformation step

Schema Evolution

Raw data schemas change. A column gets renamed, a new field is added, a data type changes from integer to string. Feature pipelines that hardcode column names break silently or loudly when schemas evolve.

Best practice: define feature schemas explicitly and validate incoming data against them before any transformation. Libraries like Great Expectations, Pandera, or dbt's schema tests make this programmatic.

Backfill Strategy

When you add a new feature, you need to backfill it historically to populate the training dataset. Backfilling a feature that requires 2 years of historical data is a different job than the incremental daily run. Plan for this explicitly - backfill jobs often need different cluster configurations (more memory, more parallelism) and must not interfere with production pipeline schedules.


Common Mistakes

:::danger Recomputing all features when only one changed If your feature pipeline is a single monolithic job that computes all features together, changing one feature forces a full recompute. This wastes compute and extends time-to-deploy for feature changes. Structure your pipeline as independent, composable feature groups that can be recomputed independently. :::

:::danger No idempotency guarantees A feature pipeline that produces different results when run twice on the same input will corrupt your feature store on any retry. Every transformation must be deterministic and side-effect-free. The output of running the pipeline twice should be identical to running it once. :::

:::warning Using .toPandas() in a Spark pipeline Calling df.toPandas() on a large Spark DataFrame collects all data to the driver node. This negates the benefit of distributed computation and will OOM-kill the driver for large datasets. Never call .toPandas() in a production pipeline on a large dataset. Use native Spark operations throughout. :::

:::warning Not tracking feature lineage If you cannot answer "which version of this feature's computation logic was used to train model v2.3?" you cannot debug regressions, reproduce results, or satisfy audit requirements. Feature lineage - the mapping from model to feature version to computation code - must be tracked explicitly. :::

:::tip Incremental over full recompute Design feature pipelines for incremental computation from day one. Full recomputes should be the exception (backfills, bug fixes) not the rule. Incremental computation reduces cost by 70-90% for most feature pipelines with rolling window features. :::


Interview Q&A

Q: What is the difference between a batch feature and a streaming feature, and how do you decide which one a new feature should be?

A: A batch feature is computed on a schedule from historical data - it has freshness measured in hours or days. A streaming feature is computed continuously from event streams - freshness measured in seconds or minutes. The decision is driven by two questions: how fresh does this feature need to be to be useful, and how expensive is it to compute in real-time? Features like "user's total spend this year" don't need to be fresh - a daily recompute is fine, and computing them in real-time would be prohibitively expensive. Features like "failed login attempts in the last 5 minutes" are useless unless they're near-real-time. Most features fall naturally into one category based on their definition.

Q: What is a feature dependency graph and why is it important for MLOps?

A: A feature dependency graph is a DAG where nodes are features or intermediate data artifacts, and edges represent data dependencies. It matters for three reasons. First, it enables correct incremental recomputation: when upstream data changes, the scheduler knows exactly which downstream features need refreshing. Second, it enables impact analysis: if you change the computation logic for a base feature, you can identify all downstream features and models that will be affected. Third, it supports lineage tracking: you can trace any model prediction back through its features to the raw data it was computed from. Without it, teams manually trace dependencies through code, miss things, and cause silent feature corruption.

Q: How do you handle schema evolution in a large-scale feature pipeline without breaking downstream consumers?

A: The key is treating schema changes like API changes - with versioning and backward compatibility. Additive changes (new columns) should be non-breaking by default; downstream consumers ignore columns they don't know about. Breaking changes (renaming columns, changing types) require a versioned migration: introduce the new schema in a new version of the feature group, migrate consumers, then deprecate the old version. In practice, this means defining feature schemas in a schema registry (or at minimum in code), validating incoming data against the declared schema, and running schema compatibility checks in CI before deploying pipeline changes.

Q: A Spark feature pipeline that takes 2 hours suddenly takes 8 hours. How do you diagnose the slowdown?

A: Start with the Spark UI. Look for stages with very high max task duration relative to median - this indicates partition skew. Check GC pressure in the executor metrics - high GC time usually means you're running with too little memory per executor. Look for shuffle read/write volume changes - a new join or groupBy may be triggering a much larger shuffle than before. Check for data volume changes - is the input dataset 4× larger? After locating the slow stage, look at the plan: is there a cross join, a cartesian product, or an unskewed join on a high-cardinality key? Common fixes: repartition by a more even key, use AQE to auto-handle skew, filter data earlier in the plan, or increase the number of shuffle partitions.

Q: How do you validate that a new feature does not leak future information into the training dataset?

A: Leakage validation requires thinking about the timeline of each data point. For each feature, identify when it becomes known. "User's 7-day spend" is known at the end of the 7-day window - if you use this feature to predict an event that happened on day 4 of the window, you've leaked future data. In practice: define the feature's "knowledge cutoff" explicitly, use point-in-time correct joins when building training datasets (joining feature values as of the label timestamp, not as of today), and run a temporal validation: train on the first 80% of time, evaluate on the last 20%. If performance degrades dramatically between validation and a held-out future test set, suspect leakage. A model with near-perfect training performance but poor production performance is the classic leakage signature.

Q: What are the main cost drivers in a feature engineering pipeline, and how do you manage them?

A: The main drivers are compute (cluster size × runtime), I/O (data read/written from cloud storage), and storage (feature store retention). Compute is managed by right-sizing clusters, using spot instances, and switching from full recompute to incremental computation. I/O is managed by efficient file formats (Parquet over CSV), predicate pushdown (filter early, before reading unnecessary partitions), and columnar reads (read only the columns you need). Storage is managed by tiering (keep last 90 days in hot storage, archive older data) and feature deprecation (retire features no model uses). The highest-leverage intervention is almost always incremental computation - most pipelines can reduce runtime by 80% by computing only what changed since the last run.

© 2026 EngineersOfAI. All rights reserved.