:::tip 🎮 Interactive Playground Visualize this concept: Try the Spark Batch Processing demo on the EngineersOfAI Playground - no code required. :::
Databricks
A company evaluated three options for running the same pipeline: Spark on EMR, Glue, and Databricks. The pipeline was a daily feature engineering job over 2 TB of user events, producing a training dataset for a churn model that fed into a production prediction service.
EMR took three days to get right. The cluster configuration alone - instance types, Spark memory settings, Iceberg configurations, IAM roles - required a specialist. The first run failed with a memory overflow. The second run failed with a serialization error. By day three it worked, but there was no monitoring, no data quality checks, and no model tracking.
Glue was simpler but had a 10-minute startup delay on every job. For a daily job that took 45 minutes, that overhead was acceptable - but Glue's Spark version lagged EMR by 6 months, the DPU pricing was hard to predict, and there was no built-in MLflow, no feature store, and no SQL analytics layer.
Databricks had the pipeline running in 2 hours. Delta Lake handled ACID writes. Delta Live Tables provided declarative data quality checks. MLflow tracked every training run. The Databricks Feature Store handled training-serving skew. SQL Analytics let the data science team query features from a BI tool. The productivity difference was not marginal.
This lesson explains what Databricks is, how its components fit together, and how to use it to build a production ML data platform.
Why This Exists - The Fragmented Data + ML Stack
Before Databricks, a production ML platform required assembling and maintaining separate systems: Spark for transformations, a custom feature store for serving, MLflow (self-hosted) for experiment tracking, a warehouse for SQL analytics, and some bespoke job scheduler to tie it all together. Each system had its own authentication, its own monitoring, its own upgrade cycle.
Databricks was founded in 2013 by the creators of Apache Spark. Their insight was that data engineering and ML are not separate disciplines that need separate tools - they are parts of the same workflow, and fragmenting the stack creates enormous friction. The Databricks Lakehouse Platform is an attempt to unify storage (Delta Lake), compute (Databricks Runtime), ML lifecycle (MLflow, Feature Store), SQL analytics (Databricks SQL), and governance (Unity Catalog) under a single managed service.
What Databricks Is
Databricks is a managed Apache Spark platform - but calling it "managed Spark" is like calling AWS "managed Linux." It is a full analytics and ML platform built on Spark as its computation foundation.
Databricks Runtime - Optimized Spark
The Databricks Runtime (DBR) is a modified version of Apache Spark with proprietary optimizations:
Delta Cache: DBR transparently caches hot data files on worker node SSDs. The first query scans Parquet files from S3 (slow, I/O-bound). Subsequent queries read from local SSD (fast). This is automatic - no configuration required.
Adaptive Query Execution (AQE): DBR automatically adjusts query execution plans based on runtime statistics - for example, dynamically changing join strategies when a table turns out to be much smaller than the optimizer estimated.
Auto-Optimize: DBR automatically runs OPTIMIZE (file compaction) and ZORDER BY reindexing on Delta tables when data size triggers thresholds. This maintains query performance without manual intervention.
# Check DBR version and Delta Cache status in a notebook
spark.conf.get("spark.databricks.delta.optimize.maxFileSize") # default: 1GB
# Verify Delta Cache is enabled
spark.conf.get("spark.databricks.io.cache.enabled") # True on most clusters
# Manually cache a hot table to SSD
spark.sql("CACHE SELECT * FROM features.user_churn_features")
Delta Live Tables - Declarative ETL with Built-In Quality
Delta Live Tables (DLT) is a declarative ETL framework. Instead of writing imperative Spark code that reads, transforms, and writes data, you declare what each table should contain. DLT handles execution, dependency resolution, error handling, and retries.
# DLT pipeline definition (save as a notebook, reference in a DLT pipeline)
import dlt
from pyspark.sql import functions as F
from pyspark.sql.window import Window
# Layer 1: Raw ingestion with schema enforcement
@dlt.table(
name="raw_user_events",
comment="Raw user events from Kinesis Firehose",
table_properties={"quality": "bronze"}
)
def raw_user_events():
return (
spark.readStream
.format("cloudFiles") # Auto Loader - incremental S3 ingestion
.option("cloudFiles.format", "json")
.option("cloudFiles.inferColumnTypes", "true")
.load("s3://my-data-lake/raw/user-events/")
)
# Layer 2: Validated and cleaned events with EXPECT constraints
@dlt.table(
name="cleaned_user_events",
comment="Validated user events - bronze → silver",
table_properties={"quality": "silver"}
)
@dlt.expect_or_drop("valid_user_id", "user_id IS NOT NULL AND user_id > 0")
@dlt.expect_or_drop("valid_event_type", "event_type IN ('view', 'click', 'purchase', 'session_end')")
@dlt.expect_or_quarantine("valid_revenue", "revenue IS NULL OR revenue >= 0") # quarantine, don't drop
def cleaned_user_events():
return (
dlt.read_stream("raw_user_events")
.withColumn("event_date", F.to_date("event_timestamp"))
.withColumn("revenue", F.coalesce(F.col("revenue"), F.lit(0.0)))
.drop("_metadata")
)
# Layer 3: Daily aggregated features (gold)
@dlt.table(
name="daily_user_features",
comment="Daily user-level aggregated features for ML",
table_properties={"quality": "gold"},
partition_cols=["event_date"]
)
@dlt.expect("positive_event_count", "daily_events > 0")
def daily_user_features():
return (
dlt.read("cleaned_user_events")
.groupBy("user_id", "event_date")
.agg(
F.count("*").alias("daily_events"),
F.sum(F.when(F.col("event_type") == "purchase", 1).otherwise(0)).alias("daily_purchases"),
F.sum("revenue").alias("daily_revenue"),
F.countDistinct("session_id").alias("unique_sessions")
)
)
EXPECT Constraint Modes
| Mode | What Happens on Violation |
|---|---|
@dlt.expect | Record the violation as a metric, keep the row |
@dlt.expect_or_drop | Drop rows that violate the constraint |
@dlt.expect_or_fail | Fail the entire pipeline if any row violates |
@dlt.expect_or_quarantine | Route violating rows to a separate quarantine table |
DLT tracks constraint metrics automatically - you can see what percentage of rows violated each constraint in the DLT pipeline UI.
Databricks Feature Store
The Databricks Feature Store solves training-serving skew by ensuring the same feature computation logic is used for both training data and online prediction requests.
from databricks.feature_store import FeatureStoreClient, FeatureLookup
from pyspark.sql import functions as F
from pyspark.sql.window import Window
import mlflow
fs = FeatureStoreClient()
# --- Step 1: Create a feature table ---
# Compute rolling window features
window_7d = Window.partitionBy("user_id").orderBy("event_date").rowsBetween(-6, 0)
window_30d = Window.partitionBy("user_id").orderBy("event_date").rowsBetween(-29, 0)
daily_events = spark.table("features.daily_user_features")
features_df = (
daily_events
.withColumn("events_7d", F.sum("daily_events").over(window_7d))
.withColumn("events_30d", F.sum("daily_events").over(window_30d))
.withColumn("revenue_30d", F.sum("daily_revenue").over(window_30d))
.withColumn("days_since_purchase",
F.datediff(
F.col("event_date"),
F.last(F.when(F.col("daily_purchases") > 0, F.col("event_date")), True)
.over(Window.partitionBy("user_id").orderBy("event_date").rowsBetween(-90, 0))
)
)
.filter(F.col("event_date") == F.current_date() - 1)
.select("user_id", "events_7d", "events_30d", "revenue_30d", "days_since_purchase")
)
# Register the feature table in the Feature Store
fs.create_table(
name="features.user_churn_features",
primary_keys=["user_id"],
df=features_df,
description="Rolling window features for churn prediction model"
)
# Write features to the feature table
fs.write_table(
name="features.user_churn_features",
df=features_df,
mode="merge" # upsert by primary key
)
# --- Step 2: Create training dataset using Feature Store lookup ---
# Start with label data (which users churned)
labels_df = spark.sql("""
SELECT user_id, churned FROM labels.churn_labels
WHERE label_date = CURRENT_DATE() - INTERVAL 1 DAY
""")
# Define feature lookups - Feature Store handles the join
feature_lookups = [
FeatureLookup(
table_name="features.user_churn_features",
feature_names=["events_7d", "events_30d", "revenue_30d", "days_since_purchase"],
lookup_key="user_id"
)
]
# Create training set - logs the exact feature versions used
with mlflow.start_run():
training_set = fs.create_training_set(
df=labels_df,
feature_lookups=feature_lookups,
label="churned",
exclude_columns=[]
)
training_df = training_set.load_df()
# Train model
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import pandas as pd
pdf = training_df.toPandas()
feature_cols = ["events_7d", "events_30d", "revenue_30d", "days_since_purchase"]
X_train, X_test, y_train, y_test = train_test_split(
pdf[feature_cols], pdf["churned"], test_size=0.2, random_state=42
)
model = GradientBoostingClassifier(n_estimators=200, max_depth=5, random_state=42)
model.fit(X_train, y_train)
auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
mlflow.log_metric("roc_auc", auc)
# Log model WITH Feature Store metadata - enables automatic feature lookup at serving time
fs.log_model(
model=model,
artifact_path="churn_model",
flavor=mlflow.sklearn,
training_set=training_set,
registered_model_name="churn_classifier"
)
print(f"Model AUC: {auc:.4f}")
# --- Step 3: Batch scoring with automatic feature join ---
# The Feature Store automatically retrieves features from the feature table
# using only the user_ids - no manual join required
users_to_score = spark.sql("""
SELECT user_id FROM dim.active_users WHERE is_active = TRUE
""")
# Scoring: Feature Store auto-joins features from the feature table
predictions = fs.score_batch(
model_uri="models:/churn_classifier/Production",
df=users_to_score,
result_type="double"
)
predictions.write.mode("overwrite").saveAsTable("predictions.churn_scores")
MLflow on Databricks - Managed Experiment Tracking
Databricks includes a managed MLflow instance with the workspace. Every experiment run, parameter, metric, and artifact is automatically persisted to cloud storage and the MLflow tracking server.
import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.metrics import roc_auc_score, average_precision_score
from sklearn.model_selection import cross_val_score
import pandas as pd
import numpy as np
mlflow.set_experiment("/experiments/churn-model-v2")
mlflow.sklearn.autolog(log_input_examples=True, log_model_signatures=True)
# Experiment: compare GBM vs RandomForest
for model_class, params in [
(GradientBoostingClassifier, {"n_estimators": 200, "max_depth": 5, "learning_rate": 0.05}),
(GradientBoostingClassifier, {"n_estimators": 300, "max_depth": 6, "learning_rate": 0.03}),
(RandomForestClassifier, {"n_estimators": 300, "max_depth": 10, "min_samples_leaf": 5}),
]:
with mlflow.start_run(run_name=f"{model_class.__name__}"):
model = model_class(**params)
# Cross-validation
cv_aucs = cross_val_score(model, X_train, y_train, cv=5,
scoring="roc_auc", n_jobs=-1)
mlflow.log_metric("cv_auc_mean", cv_aucs.mean())
mlflow.log_metric("cv_auc_std", cv_aucs.std())
# Final fit and test evaluation
model.fit(X_train, y_train)
test_auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
test_ap = average_precision_score(y_test, model.predict_proba(X_test)[:, 1])
mlflow.log_metric("test_roc_auc", test_auc)
mlflow.log_metric("test_avg_precision", test_ap)
mlflow.log_params(params)
# Feature importance
if hasattr(model, "feature_importances_"):
importance_df = pd.DataFrame({
"feature": feature_cols,
"importance": model.feature_importances_
}).sort_values("importance", ascending=False)
mlflow.log_table(importance_df, "feature_importance.json")
print(f"{model_class.__name__}: AUC={test_auc:.4f}, AP={test_ap:.4f}")
MLflow Model Registry and Promotion
import mlflow
from mlflow.tracking import MlflowClient
client = MlflowClient()
# Register the best model version
best_run_id = "abc123" # from experiment comparison
model_uri = f"runs:/{best_run_id}/churn_model"
mv = mlflow.register_model(model_uri, "churn_classifier")
print(f"Registered version: {mv.version}")
# Add validation metrics as model version tags
client.set_model_version_tag("churn_classifier", mv.version, "test_auc", "0.891")
client.set_model_version_tag("churn_classifier", mv.version, "training_date", "2026-03-12")
# Transition to Staging
client.transition_model_version_stage(
name="churn_classifier",
version=mv.version,
stage="Staging",
archive_existing_versions=False
)
# After validation, promote to Production
client.transition_model_version_stage(
name="churn_classifier",
version=mv.version,
stage="Production",
archive_existing_versions=True # archive the previous Production version
)
Databricks SQL - Serverless SQL Analytics
Databricks SQL (DBSQL) provides serverless SQL warehouses powered by the Photon engine - a C++ vectorized query execution engine that runs significantly faster than standard Spark for SQL workloads.
-- Create a Databricks SQL dashboard query for churn monitoring
-- This runs on a SQL Warehouse (Photon), not a Spark cluster
SELECT
event_date,
subscription_tier,
COUNT(DISTINCT user_id) AS active_users,
AVG(events_30d) AS avg_events_30d,
AVG(churn_probability) AS avg_churn_risk,
SUM(CASE WHEN churn_probability > 0.7 THEN 1 ELSE 0 END) AS high_risk_users,
SUM(revenue_30d) AS total_revenue_30d
FROM features.user_churn_features f
LEFT JOIN predictions.churn_scores p USING (user_id)
WHERE f.event_date >= CURRENT_DATE() - INTERVAL 30 DAYS
GROUP BY 1, 2
ORDER BY 1 DESC, 4 DESC;
Photon is not a replacement for Spark - it is a specialized engine for vectorized SQL execution. Aggregations, joins, and window functions on large tables run 2-5x faster on Photon than on standard Spark SQL. Python UDFs do not benefit from Photon (they run in Python, not the vectorized C++ layer).
Unity Catalog - Three-Level Namespace and Governance
Unity Catalog is Databricks' governance layer. It introduces a three-level namespace: catalog.schema.table, unifying governance across all Databricks workspaces in an account.
-- Unity Catalog three-level namespace
-- catalog = top-level organizational boundary
-- schema = namespace within a catalog
-- table = the actual table
-- Create a catalog for the ML platform
CREATE CATALOG IF NOT EXISTS ml_platform;
-- Create schemas within the catalog
CREATE SCHEMA IF NOT EXISTS ml_platform.features;
CREATE SCHEMA IF NOT EXISTS ml_platform.training_data;
CREATE SCHEMA IF NOT EXISTS ml_platform.predictions;
-- Grant access at catalog, schema, or table level
GRANT USE CATALOG ON CATALOG ml_platform TO `data-science-group`;
GRANT USE SCHEMA ON SCHEMA ml_platform.features TO `data-science-group`;
GRANT SELECT ON TABLE ml_platform.features.user_churn_features TO `ml-engineer-group`;
-- Column masking: PII columns return NULL for non-privileged users
CREATE OR REPLACE FUNCTION ml_platform.masks.mask_email(email STRING)
RETURNS STRING
RETURN CASE
WHEN is_account_group_member('pii-access-group') THEN email
ELSE CONCAT(LEFT(email, 2), '****@****.com')
END;
ALTER TABLE ml_platform.raw.users
ALTER COLUMN email SET MASK ml_platform.masks.mask_email;
Unity Catalog Data Lineage
Unity Catalog automatically captures column-level lineage - you can trace how a feature in the feature table was derived from raw event columns:
# Query lineage programmatically via Unity Catalog REST API
import requests
# Get upstream lineage for the churn feature table
response = requests.get(
"https://<workspace>.azuredatabricks.net/api/2.0/lineage-tracking/table-lineage",
headers={"Authorization": f"Bearer {token}"},
json={
"table_name": "ml_platform.features.user_churn_features",
"include_entity_lineage": True
}
)
lineage = response.json()
# Shows: raw_events → cleaned_events → daily_user_features → user_churn_features
for upstream in lineage.get("upstreams", []):
print(f"Upstream: {upstream['tableInfo']['full_name']}")
Delta Sharing
Unity Catalog's Delta Sharing protocol lets you share live Delta Lake data across Databricks workspaces, Snowflake, BigQuery, and any Delta Sharing-compatible client - without copying data:
-- Create a share in Unity Catalog
CREATE SHARE feature_store_share
COMMENT 'Churn prediction features for external data science team';
-- Add tables to the share
ALTER SHARE feature_store_share
ADD TABLE ml_platform.features.user_churn_features
PARTITION (event_date >= '2026-01-01');
-- Create a recipient (another Databricks workspace or external consumer)
CREATE RECIPIENT data_science_team
USING ID 'recipient-activation-url';
-- Grant access
GRANT SELECT ON SHARE feature_store_share TO RECIPIENT data_science_team;
Cluster Types - Cost Implications
Databricks has three cluster types with very different cost profiles:
All-Purpose Clusters (Interactive Development)
All-purpose clusters are persistent - they run until you terminate them. They are designed for interactive work in notebooks. They are also the most expensive option per compute-unit because they are always on.
# Configure an all-purpose cluster via the Databricks API
import requests
cluster_config = {
"cluster_name": "ml-dev-cluster",
"spark_version": "15.4.x-scala2.12",
"node_type_id": "i3.xlarge",
"num_workers": 4,
"autotermination_minutes": 30, # terminate after 30 min of inactivity
"spark_conf": {
"spark.databricks.delta.preview.enabled": "true",
"spark.sql.adaptive.enabled": "true"
},
"custom_tags": {"team": "ml", "env": "dev"}
}
:::warning All-Purpose Clusters Left Running Overnight
The most common source of unexpected Databricks costs is an all-purpose cluster left running by a developer who went home. A 4-node i3.2xlarge cluster running 12 hours overnight costs approximately $60 in DBU charges plus EC2 charges. Set autotermination_minutes to 30-60 minutes for all development clusters and enforce it via a cluster policy.
:::
Job Clusters (Automated Pipelines)
Job clusters are ephemeral - created when a job starts, terminated when it ends. They are cheaper per compute-unit than all-purpose clusters (fewer management overheads) and use spot instances by default.
# Databricks Jobs API: schedule a feature engineering pipeline with a job cluster
job_config = {
"name": "daily-feature-engineering",
"tasks": [
{
"task_key": "feature_engineering",
"notebook_task": {
"notebook_path": "/pipelines/feature_engineering",
"base_parameters": {"feature_date": "{{ds}}"}
},
"new_cluster": {
"spark_version": "15.4.x-scala2.12",
"node_type_id": "i3.2xlarge",
"num_workers": 8,
"aws_attributes": {
"availability": "SPOT_WITH_FALLBACK", # use spot, fall back to on-demand
"spot_bid_price_percent": 100
}
}
}
],
"schedule": {
"quartz_cron_expression": "0 0 2 * * ?", # 2 AM daily
"timezone_id": "UTC"
}
}
SQL Warehouses (Analytics)
SQL warehouses are serverless clusters optimized for SQL queries, powered by Photon. They autoscale based on query load and auto-suspend when idle:
# Create a SQL warehouse via the API
warehouse_config = {
"name": "ml-analytics-warehouse",
"cluster_size": "Medium", # XS, S, M, L, XL, 2X-Large, 3X-Large, 4X-Large
"min_num_clusters": 1,
"max_num_clusters": 4, # autoscale for concurrent users
"auto_stop_mins": 10,
"enable_photon": True,
"warehouse_type": "PRO" # PRO or CLASSIC
}
Cost rule of thumb: for a given compute-hour budget, prioritize job clusters (spot instances for batch ML), SQL warehouses (Photon for analytics), and minimize all-purpose cluster time (most expensive, least efficient per query).
Databricks + Airflow Integration
Many teams use Airflow for orchestration but want to run their heavy compute on Databricks. The Databricks Airflow provider enables this:
from airflow import DAG
from airflow.providers.databricks.operators.databricks import (
DatabricksRunNowOperator,
DatabricksSubmitRunOperator
)
from airflow.utils.dates import days_ago
from datetime import timedelta
with DAG(
dag_id="ml_feature_pipeline",
start_date=days_ago(1),
schedule_interval="0 2 * * *",
catchup=False,
default_args={"retries": 2, "retry_delay": timedelta(minutes=5)}
) as dag:
# Trigger an existing Databricks Job by job_id (job must already exist in Databricks)
run_feature_engineering = DatabricksRunNowOperator(
task_id="run_feature_engineering",
databricks_conn_id="databricks_default",
job_id=12345, # Databricks Job ID
notebook_params={"feature_date": "{{ ds }}"}
)
# Submit an ad hoc notebook run (no pre-configured job needed)
run_model_training = DatabricksSubmitRunOperator(
task_id="run_model_training",
databricks_conn_id="databricks_default",
new_cluster={
"spark_version": "15.4.x-scala2.12",
"node_type_id": "g4dn.xlarge", # GPU instance for model training
"num_workers": 2
},
notebook_task={
"notebook_path": "/ml/train_churn_model",
"base_parameters": {"model_version": "v{{ ds_nodash }}"}
}
)
run_feature_engineering >> run_model_training
Cost Optimization
Cluster Policies
Cluster policies enforce maximum cost guardrails for all clusters created in the workspace:
{
"autotermination_minutes": {
"type": "fixed",
"value": 30
},
"aws_attributes.availability": {
"type": "fixed",
"value": "SPOT_WITH_FALLBACK"
},
"num_workers": {
"type": "range",
"minValue": 1,
"maxValue": 10,
"defaultValue": 4
}
}
Apply a cluster policy to all interactive cluster creation - this prevents engineers from accidentally launching a 20-node on-demand cluster for debugging.
Instance Pool
Instance pools pre-allocate idle instances that clusters can claim instantly - eliminating the 3-5 minute cluster startup time without paying for always-on clusters:
# Create an instance pool of pre-allocated spot instances
pool_config = {
"instance_pool_name": "ml-spot-pool",
"node_type_id": "i3.xlarge",
"aws_attributes": {
"availability": "SPOT",
"spot_bid_price_percent": 100
},
"min_idle_instances": 5,
"max_capacity": 50,
"idle_instance_autotermination_minutes": 30
}
Clusters attached to an instance pool start in under 30 seconds. The pre-allocated instances cost money while idle but far less than running a full cluster - and the startup time improvement is significant for interactive ML development.
:::danger Unity Catalog Is Not Backward-Compatible With Hive Metastore
When you enable Unity Catalog in a Databricks workspace, tables created in Unity Catalog use a three-level namespace (catalog.schema.table). Existing notebooks and jobs that reference the Hive Metastore two-level namespace (schema.table) will break unless migrated. Migrate incrementally: move new tables to Unity Catalog first, then migrate existing tables using the Unity Catalog table migration utility.
:::
:::warning Delta Live Tables Do Not Support Arbitrary Spark Code DLT pipelines are declarative - each function defines a table, not a sequence of operations. You cannot run arbitrary Spark imperative code inside DLT. Side effects (writing to external systems, sending notifications) must happen in separate Databricks Jobs triggered by DLT pipeline completion events. Teams that try to embed operational logic inside DLT pipelines consistently hit this limitation. :::
Interview Q&A
Q1: Explain the difference between a Databricks all-purpose cluster, a job cluster, and a SQL warehouse. When would you use each?
All-purpose clusters are persistent, interactive clusters designed for notebook-based development. They support both Python and SQL, can run arbitrary Spark code, and maintain state between commands (variables, cached DataFrames). They are the most expensive option because they run continuously until terminated. Use them for: data exploration, debugging pipelines, iterative model development where you need fast iteration without cluster startup delays.
Job clusters are ephemeral - they start when a job runs and terminate when the job completes. They are cheaper per compute-unit (30-40% less than all-purpose) and are the right choice for scheduled production pipelines (daily feature engineering, nightly model training). Combined with spot instances, they can run at 60-70% of on-demand costs.
SQL warehouses are serverless clusters optimized specifically for SQL queries, powered by the Photon C++ execution engine. They autoscale based on concurrent query load, auto-suspend when idle, and provide much faster SQL performance than Spark SQL on an all-purpose cluster. Use them for: BI tool connections (Tableau, Power BI), analyst SQL queries, and any workload that is pure SQL and does not require Python Spark code.
Q2: What is Delta Live Tables and how is it different from writing a regular Spark ETL pipeline?
A regular Spark ETL pipeline is imperative: you write a sequence of read → transform → write operations. You are responsible for handling job failures (what to restart), dependency ordering (which jobs run before which), data quality checks (write your own assertions), and monitoring (set up your own logging).
Delta Live Tables is declarative: you define what each table should contain using annotated Python functions. DLT handles dependency resolution (it infers the execution order from dlt.read() references), retries failed pipelines from the last successful checkpoint, enforces data quality constraints (@dlt.expect), and provides built-in monitoring of constraint violations, pipeline health, and data freshness.
The practical difference: a DLT pipeline that produces 5 output tables requires approximately 5 function definitions. The equivalent imperative Spark pipeline requires job scheduling, checkpoint management, retry logic, quality check code, and alerting - easily 5-10x more code. The trade-off is that DLT restricts what you can do inside each function - side effects and imperative logic are not allowed.
Q3: How does the Databricks Feature Store prevent training-serving skew?
Training-serving skew occurs when features are computed differently at training time vs. serving time. The classic failure: training uses batch-computed 30-day rolling revenue stored in a feature table, but the serving endpoint computes it on-the-fly from a different database with a slightly different query - producing systematically different values.
The Databricks Feature Store prevents this by decoupling feature definition from feature consumption. You define features once in a feature table using a specific computation. When you train a model using fs.create_training_set(), the Feature Store logs the exact feature table versions used for training. When you serve predictions using fs.score_batch() or a real-time endpoint, the Feature Store automatically retrieves features from the same feature table using the same definition.
The model artifact registered via fs.log_model() includes metadata about which feature tables it depends on and which columns it reads. This makes the lineage explicit and auditable - you can always trace a model's predictions back to the exact features it consumed.
Q4: What is Unity Catalog's three-level namespace and why is it an improvement over Hive Metastore?
Hive Metastore uses a two-level namespace: database.table. In Databricks without Unity Catalog, each workspace has its own Hive Metastore - so ml_workspace.features.user_features and analytics_workspace.features.user_features are completely separate tables with separate governance. Cross-workspace access requires either duplicating data or complex sharing configurations.
Unity Catalog introduces a three-level namespace: catalog.schema.table. A catalog is an account-level resource shared across all workspaces. ml_platform.features.user_churn_features is accessible from any workspace in the Databricks account that has been granted access - no data movement, no duplication.
This enables centralized governance at the account level: a single audit log for all data access across all workspaces, unified column-level security policies, and data lineage that spans workspace boundaries. For ML platforms where data engineers (in one workspace) feed features to ML engineers (in another workspace), Unity Catalog eliminates the shared-Hive-Metastore hack or S3-path-based data passing that teams previously used.
Q5: A DLT pipeline keeps failing at the silver layer with quarantined records. How do you investigate and fix this?
First, check the DLT pipeline UI's Data Quality tab. It shows, per constraint, the count and percentage of rows that violated each EXPECT rule and what action was taken (dropped vs. quarantined vs. failed). This identifies which constraint is triggering and at what rate.
Next, query the quarantine table (if using @dlt.expect_or_quarantine). DLT writes violating rows to {table_name}_quarantined with additional metadata columns indicating which constraint failed:
SELECT _rescued_data, COUNT(*)
FROM LIVE.cleaned_user_events_quarantined
GROUP BY _rescued_data
ORDER BY 2 DESC
LIMIT 20;
Common root causes: an upstream schema change introduced a new event type not covered by the valid_event_type constraint, a data source is sending NULL user_ids for a new client SDK version, or a revenue field changed from INTEGER to DECIMAL causing type mismatch.
Fixes: update the constraint to cover the new event types, add a coalesce for NULLs upstream, or adjust the schema expectation. DLT pipelines can be updated and resumed without full reprocessing - only new data is processed after the fix.
Q6: When would you NOT choose Databricks for a data engineering project?
Databricks has significant overhead costs - the platform adds a DBU (Databricks Unit) charge on top of the underlying cloud compute. For very simple ETL jobs (a daily Python script that copies 1 GB of data), Databricks is overkill. A Lambda function, a simple Glue job, or even a scheduled EC2 script would cost a fraction of the Databricks DBU cost.
Databricks is also not the right choice when the team is primarily SQL-native and not comfortable with Spark or Python. Snowflake or BigQuery will provide a better developer experience for SQL-first teams.
For pure streaming applications with very low latency requirements (sub-100ms event processing), Databricks Spark Structured Streaming has higher latency than specialized streaming systems like Apache Flink on AWS Kinesis or Google Dataflow. Databricks Structured Streaming micro-batches at minimum 100ms intervals by design.
Finally, for organizations with existing, functioning AWS Glue or Google Dataflow pipelines - the migration cost to Databricks needs to be justified by clear productivity gains. Databricks makes sense when you need the full lakehouse + ML lifecycle stack and are willing to pay the platform premium for unified tooling.
