Skip to main content

:::tip 🎮 Interactive Playground Visualize this concept: Try the Data Quality Checks demo on the EngineersOfAI Playground - no code required. :::

dbt Advanced Patterns for ML Teams

import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem';

Reading time: 28 min | Relevance: High for Analytics Engineers, ML Engineers, Data Engineers | Level: Advanced


The Hidden Leakage Problem

The senior analytics engineer at a health-tech company has 300 dbt models in production. She has been using dbt for two years, her tests all pass, and her data catalog is beautifully documented. Then the ML team calls her with a problem.

Their patient risk model was trained on features computed from the current state of patient records - demographics, diagnoses, medications - joined against historical outcomes (hospital readmission within 30 days). The model looked excellent in offline evaluation: AUC 0.83. But deployed in production, AUC dropped to 0.71. The offline/online gap is alarming, and the ML team suspects feature leakage.

She digs in. The feature table was built with a straightforward incremental dbt model: for each patient and each outcome event, it joins the patient profile table to get demographics. The patient profile table reflects the current state of the patient record - not the state it was in at the time of the outcome event. A patient who updated their address, added a new diagnosis, or changed medications after the outcome event is being joined on their future state. The model trained on information that would not have been available at prediction time. This is leakage, and it is invisible to standard dbt tests.

Fixing this requires understanding Slowly Changing Dimensions, dbt snapshots, and point-in-time correct joins - three concepts that separate a basic dbt user from an expert one. The same root cause affects recommendation models (using updated product metadata), churn models (using updated customer segments), and fraud models (using updated account status). Any ML feature derived from a table that changes over time and is joined without temporal awareness is potentially leaky.

This lesson covers the advanced patterns that eliminate these problems, along with techniques for scaling a 300-model project without a 4-hour run time.


Why These Patterns Exist

Standard SQL databases and data warehouses show you the current state of the world. If a customer updated their email address, the customers table shows the new email - the old one is gone. For reporting and business intelligence, this is usually fine. For ML, it is catastrophic: your training data must represent what the world looked like at the moment a prediction would have been made, not what it looks like today.

The technical term for this is point-in-time correctness (also called temporal correctness or "as-of" queries). Building a point-in-time correct training dataset requires that you have tracked historical states of mutable entities - every version of a customer record, every change to a product attribute. This is the problem that Slowly Changing Dimensions (SCD) and specifically dbt snapshots solve.


Historical Context

The concept of Slowly Changing Dimensions was formalized by Ralph Kimball in the 1990s as part of dimensional modeling. SCD Type 2 - keeping a full history of all changes with valid_from and valid_to dates - has been the standard for data warehousing for three decades. What changed with dbt is that the implementation of SCD2 became first-class and declarative, rather than something each team implemented inconsistently in stored procedures.

dbt Packages emerged around 2019 when the dbt community realized they were solving the same problems repeatedly. The dbt-utils package (contributed by dbt Labs) and dbt-expectations (a wrapper around Great Expectations, contributed by the community) became the two most-used packages. By 2023, the dbt Package Hub listed over 200 community packages.


Advanced Jinja: Custom Macros

The safe_divide Macro

Division by zero is one of the most common causes of NULL values in feature tables. Wrap it:

-- macros/safe_divide.sql

{% macro safe_divide(numerator, denominator) %}
CASE
WHEN {{ denominator }} = 0 OR {{ denominator }} IS NULL
THEN NULL
ELSE {{ numerator }}::float / {{ denominator }}
END
{% endmacro %}

Use in a model:

-- models/marts/ml/fct_user_engagement.sql

SELECT
user_id,
sessions_30d,
conversions_30d,
{{ safe_divide('conversions_30d', 'sessions_30d') }} AS conversion_rate_30d,
{{ safe_divide('revenue_30d', 'sessions_30d') }} AS revenue_per_session_30d
FROM {{ ref('int_user_session_stats') }}

Custom Schema Routing: generate_schema_name

In large projects with multiple teams, you want different model groups to land in different warehouse schemas without hardcoding. Override dbt's built-in macro:

-- macros/generate_schema_name.sql
-- Overrides dbt's default schema naming behavior

{% macro generate_schema_name(custom_schema_name, node) -%}
{%- set default_schema = target.schema -%}

{%- if custom_schema_name is none -%}
{{ default_schema }}

{%- elif target.name == 'prod' -%}
{# In prod: use the custom schema name directly (no prefix) #}
{{ custom_schema_name | trim }}

{%- else -%}
{# In dev/staging: prefix with developer name to avoid collisions #}
{{ default_schema }}_{{ custom_schema_name | trim }}

{%- endif -%}
{%- endmacro %}

Then in dbt_project.yml:

models:
ecomm_dbt:
marts:
ml:
+schema: ml_features # prod: ml_features, dev: amara_ml_features
reporting:
+schema: reporting

dbt Packages

Installing Packages

# packages.yml
packages:
- package: dbt-labs/dbt_utils
version: 1.1.1
- package: calogica/dbt_expectations
version: 0.10.0
- package: dbt-labs/dbt_project_evaluator
version: 0.8.0

Run dbt deps to install. Packages download into dbt_packages/ - add to .gitignore.

dbt-utils: The Standard Library

The most important macros from dbt-utils:

surrogate_key - Generate a consistent surrogate key from multiple columns:

SELECT
{{ dbt_utils.generate_surrogate_key(['user_id', 'event_date']) }} AS pk,
user_id,
event_date,
sessions
FROM {{ ref('int_daily_sessions') }}

pivot - Turn rows into columns without dynamic SQL:

{{ dbt_utils.pivot(
column='product_category_code',
agg='SUM',
then_value='total_amount_usd',
values=['electronics', 'clothing', 'books', 'home'],
alias=True,
prefix='revenue_'
) }}

date_spine: Filling Time Series Gaps

This is critical for ML training data. If a user had no purchases on a given day, a naive aggregation will simply have no row for that user+day. ML models trained on sparse data without explicit zeros will learn that "no data" means "no activity" - but only implicitly. Making zeros explicit produces cleaner features.

date_spine generates a complete series of dates with no gaps:

-- models/intermediate/int_daily_user_features.sql
-- Produces one row per user per day, with zeros for days with no activity

{{
config(materialized='table')
}}

WITH date_spine AS (
{{ dbt_utils.date_spine(
datepart="day",
start_date="cast('2023-01-01' as date)",
end_date="cast(current_date as date)"
) }}
),

users AS (
SELECT DISTINCT customer_id
FROM {{ ref('stg_orders') }}
WHERE created_at_utc >= '2023-01-01'
),

-- Cross join: every user x every date = complete sparse frame
user_dates AS (
SELECT
u.customer_id,
d.date_day
FROM users u
CROSS JOIN date_spine d
),

daily_orders AS (
SELECT
customer_id,
DATE(created_at_utc) AS order_date,
COUNT(*) AS orders,
SUM(total_amount_usd) AS revenue
FROM {{ ref('stg_orders') }}
WHERE status = 'completed'
GROUP BY 1, 2
)

SELECT
ud.customer_id,
ud.date_day,
COALESCE(do.orders, 0) AS orders, -- explicit zero, not NULL
COALESCE(do.revenue, 0) AS revenue -- explicit zero, not NULL
FROM user_dates ud
LEFT JOIN daily_orders do
ON ud.customer_id = do.customer_id
AND ud.date_day = do.order_date

:::tip Why zeros matter for ML Tree-based models (XGBoost, LightGBM) handle NULLs by routing them down a specific branch during training. If your training data has NULLs where there should be zeros, the model learns a "missing data" pattern rather than a "zero activity" pattern. When serving, actual zeros (which come through as real values, not NULLs) are routed differently, causing subtle prediction errors. Always fill gaps with explicit zeros. :::


dbt Snapshots: Slowly Changing Dimensions Type 2

A snapshot is dbt's mechanism for tracking historical changes to a mutable source table. Every time a snapshot runs, dbt checks for new or changed rows and appends them to a separate snapshot table with dbt_valid_from and dbt_valid_to timestamps.

Snapshot Configuration

-- snapshots/patient_profile_snapshot.sql

{% snapshot patient_profile_snapshot %}

{{
config(
target_schema='snapshots',
unique_key='patient_id',
strategy='timestamp',
updated_at='updated_at'
)
}}

SELECT
patient_id,
age_band,
gender,
primary_diagnosis_code,
medication_count,
insurance_plan_type,
zip_code,
updated_at
FROM {{ source('raw', 'patient_profiles') }}

{% endsnapshot %}

Run with dbt snapshot. The result is a table in snapshots.patient_profile_snapshot with these additional columns added by dbt:

ColumnMeaning
dbt_scd_idUnique identifier for this specific version of the record
dbt_updated_atWhen dbt detected this change
dbt_valid_fromWhen this version of the record became active
dbt_valid_toWhen this version was superseded (NULL = current active record)

Two Snapshot Strategies

strategy: timestamp (above): dbt compares the updated_at column. If it has changed, a new snapshot row is inserted. This is efficient but requires a reliable updated_at field on the source.

strategy: check: dbt hashes the specified columns and compares hashes. Use when the source has no reliable updated_at:

{% snapshot customer_profile_snapshot %}
{{
config(
target_schema='snapshots',
unique_key='customer_id',
strategy='check',
check_cols=['email', 'address_zip', 'plan_tier', 'is_premium']
)
}}
SELECT customer_id, email, address_zip, plan_tier, is_premium
FROM {{ source('raw', 'customers') }}
{% endsnapshot %}

Point-in-Time Correct Features: The Core Pattern

With a snapshot table in place, you can now build features that are correct as of any historical timestamp. The pattern uses an "as-of join":

-- models/marts/ml/fct_patient_risk_features.sql
-- Point-in-time correct: join patient profile as it existed AT the time of each outcome event

{{
config(materialized='table')
}}

WITH outcome_events AS (
-- The events we want to predict: hospital readmissions
SELECT
patient_id,
admission_date,
readmitted_within_30d -- label
FROM {{ ref('stg_hospital_admissions') }}
),

-- The patient profile as it existed at the time of each admission
-- Using the SCD2 snapshot table
patient_profile_at_time AS (
SELECT
oe.patient_id,
oe.admission_date,
oe.readmitted_within_30d,

-- Profile features AS OF the admission_date, not current state
pp.age_band,
pp.gender,
pp.primary_diagnosis_code,
pp.medication_count,
pp.insurance_plan_type,
pp.zip_code
FROM outcome_events oe
LEFT JOIN {{ ref('patient_profile_snapshot') }} pp
ON oe.patient_id = pp.patient_id
-- The admission must fall within the valid window of this profile version
AND oe.admission_date >= pp.dbt_valid_from
AND (
oe.admission_date < pp.dbt_valid_to
OR pp.dbt_valid_to IS NULL -- current active record
)
)

SELECT * FROM patient_profile_at_time

This join ensures that for each outcome event you use the patient profile version that was active at that time - not the current version. The OR pp.dbt_valid_to IS NULL condition captures the current active record for events happening now.

:::danger The most common leakage mistake Joining on patient_id alone - without the temporal condition - always uses the current record. This looks correct in development (when there is only one row per patient in the source), but introduces leakage in production (where the snapshot table has multiple historical versions per patient). Always join snapshot tables with both the entity key AND the temporal window condition. :::


Advanced dbt Execution Diagram


Hooks: Pre/Post Execution Logic

Hooks let you run arbitrary SQL before or after model execution - useful for grants, audit logging, and calling stored procedures:

# dbt_project.yml

on-run-start:
- "CREATE TABLE IF NOT EXISTS audit.dbt_run_log (run_id TEXT, started_at TIMESTAMP, target TEXT)"
- "INSERT INTO audit.dbt_run_log VALUES ('{{ invocation_id }}', CURRENT_TIMESTAMP, '{{ target.name }}')"

on-run-end:
- "UPDATE audit.dbt_run_log SET completed_at = CURRENT_TIMESTAMP WHERE run_id = '{{ invocation_id }}'"

models:
ecomm_dbt:
marts:
+post-hook:
- "GRANT SELECT ON {{ this }} TO ROLE ml_training_role"
- "GRANT SELECT ON {{ this }} TO ROLE reporting_role"

The post-hook on the marts/ directory automatically grants SELECT to the ML training role on every mart model after it is created. Without this, your ML training job fails with a permissions error after any --full-refresh that drops and recreates the table.


Exposures: Documenting Downstream Consumers

Exposures document what consumes your dbt models - BI dashboards, ML models, APIs. This makes the lineage graph complete end-to-end:

# models/exposures.yml
version: 2

exposures:
- name: patient_risk_model_v3
type: ml_model
maturity: production
url: https://mlflow.internal/models/patient-risk/3
description: >
XGBoost model predicting 30-day hospital readmission risk.
Trained weekly on fct_patient_risk_features. Served via SageMaker.
owner:
name: ML Engineering Team
depends_on:
- ref('fct_patient_risk_features')

- name: clinical_ops_dashboard
type: dashboard
maturity: production
url: https://looker.internal/dashboards/clinical-ops
owner:
name: Clinical Analytics
depends_on:
- ref('fct_patient_risk_features')
- ref('dim_patient_360')

When you view fct_patient_risk_features in dbt docs, you see both its upstream sources and its downstream consumers - the ML model and the dashboard. When a breaking change is proposed to this model, you know exactly what will be affected.


dbt-expectations: ML-Grade Data Tests

dbt-expectations wraps Great Expectations into dbt-style YAML tests, providing richer validation than built-in tests:

# models/marts/ml/schema.yml

models:
- name: fct_patient_risk_features
tests:
- dbt_expectations.expect_table_row_count_to_be_between:
min_value: 10000 # at least 10k training examples
max_value: 10000000
- dbt_expectations.expect_table_columns_to_match_ordered_list:
column_list:
- patient_id
- admission_date
- age_band
- gender
- primary_diagnosis_code
- medication_count
- insurance_plan_type
- readmitted_within_30d

columns:
- name: medication_count
tests:
- dbt_expectations.expect_column_values_to_be_between:
min_value: 0
max_value: 100 # more than 100 is almost certainly a data error
- dbt_expectations.expect_column_mean_to_be_between:
min_value: 1.0
max_value: 20.0 # realistic population mean

- name: readmitted_within_30d
tests:
- accepted_values:
values: [0, 1] # must be binary label

These tests catch data quality issues that destroy ML model performance: features drifting outside expected ranges, wrong column order breaking downstream sklearn pipelines, label columns with unexpected values.


Slim CI: Running Only What Changed

At 300 models, dbt run takes 4 hours. Running the full project on every PR is not viable. Slim CI solves this by using the dbt state concept:

# In production: after a successful dbt build, upload the manifest
dbt build
aws s3 cp target/manifest.json s3://my-bucket/dbt-state/prod/manifest.json

In CI on every PR:

# Download the last production manifest
aws s3 cp s3://my-bucket/dbt-state/prod/manifest.json previous_manifest/manifest.json

# Run only models that have changed (and their downstream dependents)
dbt build \
--select state:modified+ \
--state previous_manifest/

state:modified+ means: run any model whose definition has changed, plus all models downstream of it. If you change stg_orders.sql, slim CI runs stg_orders, int_user_order_stats, fct_user_features, and any other models in that subgraph - but skips the 250 models that are completely unaffected.

GitLab CI Pipeline for dbt Slim CI

# .gitlab-ci.yml

stages:
- validate
- deploy

variables:
DBT_PROFILES_DIR: $CI_PROJECT_DIR
DBT_TARGET: prod

dbt-slim-ci:
stage: validate
image: python:3.11-slim
rules:
- if: '$CI_PIPELINE_SOURCE == "merge_request_event"'
before_script:
- pip install dbt-bigquery==1.7.0 dbt-utils dbt-expectations
- dbt deps
# Download the last production manifest for state comparison
- aws s3 cp s3://${STATE_BUCKET}/dbt-state/prod/manifest.json previous_manifest/manifest.json
script:
# Parse the project - catches syntax errors fast
- dbt parse
# Run only changed models and their downstream dependencies
- >
dbt build
--select state:modified+
--state previous_manifest/
--target ci
artifacts:
paths:
- target/manifest.json
expire_in: 1 hour

dbt-deploy:
stage: deploy
image: python:3.11-slim
rules:
- if: '$CI_COMMIT_BRANCH == "main"'
before_script:
- pip install dbt-bigquery==1.7.0 dbt-utils dbt-expectations
- dbt deps
script:
# Full production build on merge to main
- dbt source freshness
- dbt build --target prod
# Upload manifest for next slim CI comparison
- aws s3 cp target/manifest.json s3://${STATE_BUCKET}/dbt-state/prod/manifest.json

Advanced Testing: Building a Data Quality Gate

For ML pipelines, data quality is not optional - a model trained on corrupted features produces silently wrong predictions. The goal is to build a quality gate that blocks ML training if data quality fails.

Multi-Layer Testing Strategy

# models/marts/ml/schema.yml - comprehensive test suite

version: 2

models:
- name: fct_user_features
description: >
One row per customer. Updated nightly at 02:00 UTC.
Primary input to churn prediction and recommendation models.
meta:
ml_consumers: ["churn-model-v2", "reco-model-v5"]
sla: "available by 03:00 UTC"

tests:
# Table-level tests: shape and freshness
- dbt_expectations.expect_table_row_count_to_be_between:
min_value: 100000
max_value: 50000000
- dbt_expectations.expect_table_columns_to_match_ordered_list:
column_list:
- customer_id
- orders_30d
- revenue_30d
- avg_order_value_30d
- last_order_at
- discount_rate_30d
- unique_categories_30d
- feature_computed_at

columns:
- name: customer_id
tests:
- unique
- not_null

- name: orders_30d
tests:
- not_null
- dbt_expectations.expect_column_values_to_be_between:
min_value: 0
max_value: 10000
- dbt_expectations.expect_column_mean_to_be_between:
min_value: 0.5 # average customer has at least 0.5 orders in 30d
max_value: 50.0 # if average is above 50, something is double-counting

- name: revenue_30d
tests:
- not_null
- dbt_expectations.expect_column_values_to_be_between:
min_value: 0
max_value: 100000 # $100k in 30 days would be extremely unusual

- name: discount_rate_30d
tests:
- not_null
- dbt_expectations.expect_column_values_to_be_between:
min_value: 0.0
max_value: 1.0 # must be a proportion

- name: feature_computed_at
tests:
- not_null
- dbt_expectations.expect_column_values_to_be_between:
min_value: "dateadd(hour, -26, current_timestamp)"
max_value: "current_timestamp"
# Feature must have been computed within the last 26 hours
# (24h schedule + 2h buffer) - catches stale feature tables

Custom Statistical Tests

For detecting feature drift - when feature distributions shift significantly from historical norms - write custom SQL tests:

-- tests/assert_orders_30d_not_drifted.sql
-- Fails if today's average orders_30d is more than 2 standard deviations
-- from the 7-day rolling average

WITH current_stats AS (
SELECT AVG(orders_30d) AS today_mean
FROM {{ ref('fct_user_features') }}
),

historical_stats AS (
-- Assume you log daily feature stats to an audit table
SELECT
AVG(daily_mean) AS hist_mean,
STDDEV(daily_mean) AS hist_stddev
FROM {{ ref('fct_feature_stats_log') }}
WHERE stat_date >= CURRENT_DATE - INTERVAL '7 days'
AND feature_name = 'orders_30d'
),

drift_check AS (
SELECT
current_stats.today_mean,
historical_stats.hist_mean,
historical_stats.hist_stddev,
ABS(current_stats.today_mean - historical_stats.hist_mean) /
NULLIF(historical_stats.hist_stddev, 0) AS z_score
FROM current_stats, historical_stats
)

-- Return rows (test failure) when z-score exceeds 2 standard deviations
SELECT *
FROM drift_check
WHERE z_score > 2.0

This test implements a simple statistical process control check: if today's feature mean is more than 2 standard deviations from the recent historical mean, the pipeline halts before training begins.


Performance Tuning a 300-Model Project

Threads and Parallelism

# profiles.yml
ecomm_dbt:
target: prod
outputs:
prod:
type: bigquery
project: my-project
dataset: dbt_prod
threads: 16 # run 16 models in parallel
timeout_seconds: 900

Increasing threads reduces wall-clock time proportionally up to your warehouse's concurrency limit. On BigQuery, 16 threads is usually safe. On Snowflake, it depends on warehouse size.

Partial Parsing

# dbt_project.yml
flags:
partial_parse: true # only re-parse changed files - speeds up startup 10-30 seconds

Model-Level Tags for Selective Runs

-- models/marts/ml/fct_user_features.sql
{{
config(
materialized='incremental',
tags=['ml', 'nightly', 'critical']
)
}}
# Run only models tagged 'critical' during incidents
dbt build --select tag:critical

# Run only nightly models in the scheduled job
dbt build --select tag:nightly

Cross-Database Macros: Writing Adapter-Agnostic SQL

A major operational challenge for large dbt projects is supporting multiple warehouses - BigQuery in production, DuckDB or Postgres in development, Snowflake in a client environment. SQL dialects differ in small but painful ways.

dbt provides adapter-dispatched macros as the solution: write the logic once using a dispatch macro, with implementations for each adapter:

-- macros/date_trunc_week.sql
-- Returns the Monday of the week containing the given date
-- Adapter-specific because BigQuery and Snowflake use different conventions

{% macro date_trunc_week(column) %}
{{ return(adapter.dispatch('date_trunc_week', 'my_macros')(column)) }}
{% endmacro %}

{% macro bigquery__date_trunc_week(column) %}
DATE_TRUNC({{ column }}, WEEK(MONDAY))
{% endmacro %}

{% macro snowflake__date_trunc_week(column) %}
DATE_TRUNC('week', {{ column }})
{% endmacro %}

{% macro postgres__date_trunc_week(column) %}
DATE_TRUNC('week', {{ column }}) + INTERVAL '1 day'
-- Postgres WEEK starts on Sunday; add 1 day to get Monday
{% endmacro %}

{% macro duckdb__date_trunc_week(column) %}
DATE_TRUNC('week', {{ column }}) + INTERVAL '1 day'
{% endmacro %}

Use in a model - dbt automatically dispatches to the right implementation:

SELECT
user_id,
{{ date_trunc_week('event_date') }} AS week_start,
COUNT(*) AS weekly_sessions
FROM {{ ref('stg_events') }}
GROUP BY 1, 2

This pattern lets the same dbt project run against BigQuery in production, Postgres in local development (via dbt-postgres), and DuckDB for fast unit tests - without any model-level SQL changes.


The dbt Semantic Layer

The semantic layer (MetricFlow) allows you to define business metrics centrally and query them from any consumer - BI tools, ML pipelines, notebooks:

# models/metrics.yml
version: 2

metrics:
- name: monthly_active_users
label: Monthly Active Users
model: ref('fct_user_activity')
description: "Users with at least one session in the trailing 30 days"
calculation_method: count_distinct
expression: user_id
timestamp: activity_date
time_grains: [day, week, month, quarter]
dimensions:
- plan_tier
- country_code

- name: revenue_per_user
label: Revenue Per User
model: ref('fct_user_features')
calculation_method: average
expression: revenue_30d
timestamp: feature_computed_at
time_grains: [day, week, month]

These metrics can be queried consistently by Looker, Metabase, the dbt Cloud IDE, and your ML training code - all using the same definition. This eliminates the "which SQL is the correct one?" problem that plagues organizations with multiple BI tools.


Production Engineering Notes

:::tip Snapshot frequency matters for ML Snapshots only capture the state of the source table when the snapshot job runs. If a patient record changes 5 times between two snapshot runs, only the final state before the next run is captured. For entities that change frequently (order status, account balance), run snapshots multiple times per day. For slowly-changing entities (demographic profile, plan tier), nightly is usually sufficient. :::

:::warning dbt Cloud vs dbt Core dbt Core is free and open source - it is the CLI you use for local development and can be run in any CI/CD system. dbt Cloud is a managed SaaS product from dbt Labs that adds: a web IDE, a job scheduler (replacing cron), a semantic layer server, and the dbt Explorer (enhanced lineage visualization). For most teams, the GitLab CI + dbt Core approach in this lesson is equivalent in functionality and significantly cheaper. dbt Cloud becomes worth the cost (starting ~$100/developer/month) when you need the web IDE for non-engineers or the managed scheduler with alerting. :::

:::danger Never run snapshots with --full-refresh Running dbt snapshot --full-refresh drops and recreates the snapshot table, losing all historical records. This is almost never what you want. Snapshots are append-only by design - the history is the value. If you need to fix a bug in a snapshot, write a corrective migration against the snapshot table directly, or re-run the snapshot from the beginning by rebuilding from scratch in a separate development environment first. :::


Common Mistakes

Mistake 1: Running snapshots too infrequently. A snapshot that runs once a day captures at most one state change per entity per day. For patient records that can change multiple times (medication added, diagnosis updated, status changed), a once-daily snapshot will miss intermediate states. For high-velocity entities, run snapshots every 4-6 hours.

Mistake 2: Not understanding the check strategy's cost. The check strategy hashes every row in the source table on every snapshot run. For a 50M-row customer table, this is expensive. Use timestamp strategy whenever the source has a reliable updated_at, and reserve check for sources that genuinely have no update timestamp.

Mistake 3: Using state:modified+ without a valid state artifact. If the production manifest.json is stale or missing, slim CI either falls back to running all models or fails. Make the manifest upload a required step in your production deployment pipeline, not an optional one. Treat a missing manifest as a pipeline failure that blocks the next release.

Mistake 4: Writing macros before understanding the problem. Jinja macros are powerful but make SQL harder to debug. A 200-line macro that generates 40 feature columns will produce a 2,000-line compiled SQL file that is nearly impossible to debug when something is wrong. Write macros for well-understood, repeating patterns - not for one-off complex logic. The threshold question: would you copy this SQL block more than 3 times? If yes, macro it. If no, leave it inline.

Mistake 5: Ignoring the dbt project evaluator output. Teams accumulate technical debt in dbt projects just like in application code. Models without tests are feature tables without validation. Models without documentation are a future on-call nightmare. Run dbt-project-evaluator monthly and treat its findings like lint warnings - fix them before they accumulate. A 300-model project with 100 untested models is a liability, not an asset.


Interview Q&A

Q: What is a dbt snapshot and when would you use it for ML?

A dbt snapshot is dbt's implementation of Slowly Changing Dimensions Type 2. When you run dbt snapshot, dbt compares the current state of a source table against its previous snapshot and appends new rows for any records that have changed, stamping each row with dbt_valid_from and dbt_valid_to timestamps. For ML, snapshots are essential for any feature derived from a mutable entity - customer profiles, product attributes, account status, clinical records. Without snapshots, your feature join always uses the current state of the entity, introducing temporal leakage: training examples see future information about what the entity looks like after the label event, not what it looked like at prediction time. Use snapshots whenever your training data needs to correctly reflect what the world looked like at a specific historical moment.

Q: How do you implement point-in-time correct features in dbt?

You need two ingredients: a snapshot table (SCD2) for the mutable entity, and a clear "as-of timestamp" for each training example. The join pattern is: for each training example row, join the snapshot table on both the entity key AND the temporal window: entity_id = snapshot.entity_id AND example_timestamp BETWEEN snapshot.dbt_valid_from AND snapshot.dbt_valid_to. The dbt_valid_to IS NULL condition captures the current active record. This pattern guarantees that each training example sees only the attribute values that existed at the time of the event. For complex scenarios with multiple slowly-changing entities, this join may be repeated for each entity type. The point-in-time join pattern is the most important technique for eliminating training/serving skew from temporal leakage.

Q: What is slim CI in dbt and why does it matter at scale?

Slim CI is the practice of comparing the current project state against a known-good state artifact (a production manifest.json) and running only models that have changed plus their downstream dependencies, using dbt build --select state:modified+ --state previous_state/. At 10 or 20 models, running everything in CI takes seconds and slim CI offers no benefit. At 300 models, a full dbt build might take 4 hours - completely unacceptable for a PR feedback loop. Slim CI reduces CI time to minutes by only exercising the affected subgraph. The critical implementation detail is that the production manifest must be uploaded to a reliable location (S3, GCS) after every successful production build, so CI always has a valid artifact to compare against.

Q: How do dbt packages extend functionality?

dbt packages are collections of macros, models, and tests published to the dbt Package Hub and installed via packages.yml and dbt deps. The most important are: dbt-utils (the standard library - surrogate_key, date_spine, pivot, accepted_range), dbt-expectations (Great Expectations wrapped as dbt YAML tests, for statistical data validation), and dbt-project-evaluator (structural linting for dbt projects). For ML teams, dbt-utils's date_spine is the most critical: it generates complete date sequences with no gaps, which is essential for creating properly-shaped training datasets where days with no activity have explicit zero rows rather than simply being absent.

Q: What is the dbt semantic layer?

The dbt semantic layer (built on MetricFlow, acquired by dbt Labs in 2023) is a centralized metric definition layer that lives in dbt and can be queried by any consumer - BI tools, ML training code, notebooks. Instead of each team writing their own SQL definition of "monthly active users" or "revenue per user," the definition lives once in dbt as a YAML metric declaration. Any tool that integrates with the semantic layer can query this metric consistently, with the same filtering, time graining, and dimension breakdowns. For ML teams, this matters because it eliminates metric definition drift between the feature engineering code and the reporting code - your model trains on the same definition of "user engagement" that the business reports on.

Q: How would you structure a dbt project for 300+ models?

Structure and governance are what separate a maintainable 300-model project from an unmaintainable one. Key principles: enforce the staging → intermediate → marts layering strictly - no model in marts should reference raw sources directly. Use tags (ml, reporting, critical, nightly) to enable selective execution. Use custom schema routing to isolate different model groups into separate warehouse schemas, which helps with access control and query cost attribution. Run dbt-project-evaluator regularly and enforce a zero-tolerance policy on models without tests or descriptions. Use slim CI on all PRs so changes are validated quickly. Assign model ownership via a meta field in the model config and use it to route alert notifications to the right team. Consider splitting into multiple dbt projects (a platform project for shared staging models and a domain project for domain-specific marts) once the team exceeds ~5 engineers - this prevents merge conflicts on shared files and scales governance independently.

© 2026 EngineersOfAI. All rights reserved.