:::tip 🎮 Interactive Playground Visualize this concept: Try the Dataset Lineage demo on the EngineersOfAI Playground - no code required. :::
Data Catalog and Discovery
The New Engineer Problem
She had been hired as a data scientist on the growth team. Her first task: build a model to predict which users were likely to upgrade from free to paid. She needed user purchase history.
The company had been operating for five years. The data warehouse had 3,000 tables across four databases and two data lakes. There was no central documentation. There was a Confluence wiki with a "Data Dictionary" page that had not been updated in two years.
She spent her first two weeks doing the following: searching Confluence, searching Slack, asking four different engineers which table contained purchase history. The first engineer pointed her to events.raw_purchases - deprecated, last updated 18 months ago. The second pointed to warehouse.user_transactions - the right lineage, but only contained transactions through 2022 due to a migration that had not been backfilled. The third pointed to mart.purchase_events_v2 - the most recent table, but contained only B2B purchases, not consumer. The fourth, who had been at the company the longest, pointed her to mart.purchase_events_v3_prod, which was correct - but had no documentation, no owner listed, and no indication of its update schedule.
Two weeks of onboarding cost. Two weeks of a senior engineer's time, spread across four people, to answer a question that should have taken 30 seconds.
A data catalog would have answered her question in 30 seconds. She would have searched "purchase history", found four results ranked by freshness and usage frequency, seen immediately which ones were deprecated, which ones were production, who owned them, and how often they were updated. She would have started building her model on day three instead of day fifteen.
Why Data Catalogs Exist
Every data asset in a large organization has metadata: what table it is, what columns it has, who owns it, when it was last updated, what quality checks pass or fail, what tables it depends on. The problem is that this metadata is scattered across dozens of systems - in database information schemas, in dbt documentation, in Airflow DAG definitions, in Confluence pages, in engineers' heads.
A data catalog is the centralized store for all of this metadata. It ingests from every source (databases, dbt, Airflow, BI tools, ML platforms), normalizes the metadata into a unified schema, and makes it searchable through a single interface.
The discovery problem is real at any organization with more than a few hundred tables. Without a catalog, the institutional knowledge of what data exists and what it means is tribal - held by individual engineers and analysts who have been around long enough to have absorbed it. When those people leave, the knowledge leaves with them.
The catalog replaces tribal knowledge with documented, searchable, automatically-maintained metadata. It is the difference between asking "does anyone know which table has purchase history?" and searching a system that answers in under a second with ranked, up-to-date results.
Historical Context
The enterprise data governance world has had catalog-like tools since the 2000s. IBM InfoSphere Information Catalog, Informatica Axon, Collibra - these were heavy, expensive, consultant-deployed tools built primarily for compliance and data governance in regulated industries. They required manual curation of metadata and were often more burden than benefit.
The modern data catalog was reinvented by the tech industry in the 2015–2020 period in response to the scale of the modern data stack. Airbnb built Dataportal (2017), LinkedIn built DataHub (2019, open-sourced 2020), Lyft built Amundsen (2019, open-sourced 2019), and WeWork built Marquez (focused on lineage). All of these shared a core insight: metadata should be automatically collected from the systems that produce it, not manually maintained by catalog administrators.
This shift - from manual curation to automated ingestion - is what makes modern catalogs practical at scale. When metadata is automatically collected from the warehouse, dbt docs, Airflow, and BI tools, it stays current without any ongoing maintenance effort.
DataHub: Architecture and Ingestion
DataHub is the most widely adopted open-source data catalog. Built by LinkedIn and open-sourced in 2020, it supports ingestion from 50+ data sources, provides a lineage graph, business glossary, ownership management, and quality integration.
DataHub's architecture has three layers:
- Ingestion framework: Python-based connectors that pull metadata from source systems and emit it as metadata change events (MCEs)
- Metadata store: a graph database (Neo4j or Elasticsearch-backed) that stores entities (datasets, users, jobs), their properties, and the relationships between them
- APIs and UI: GraphQL API, REST API, and a React web UI for search, browse, and entity pages
Configuring DataHub Ingestion
DataHub ingestion is configured via YAML "recipes." Each recipe specifies a source connector, its configuration, and the sink (usually the DataHub GMS server).
# datahub-recipes/snowflake-ingestion.yml
source:
type: snowflake
config:
account_id: "mycompany.us-east-1"
username: "${SNOWFLAKE_USER}"
password: "${SNOWFLAKE_PASSWORD}"
role: "DATAHUB_ROLE"
warehouse: "COMPUTE_WH"
database_pattern:
allow:
- "PROD_DB"
- "STAGING_DB"
# Capture column-level statistics via profiling
profiling:
enabled: true
profile_table_level_only: false
max_workers: 4
sink:
type: datahub-rest
config:
server: "http://datahub-gms:8080"
token: "${DATAHUB_TOKEN}"
# Run this recipe on a schedule in Airflow or cron
# datahub ingest -c snowflake-ingestion.yml
# datahub-recipes/dbt-ingestion.yml
source:
type: dbt
config:
manifest_path: "/dbt/target/manifest.json"
catalog_path: "/dbt/target/catalog.json"
run_results_path: "/dbt/target/run_results.json"
target_platform: "snowflake"
# Pull dbt test results as quality indicators
include_test_results: true
# Include column-level lineage from dbt SQL
include_column_lineage: true
# Map dbt owners to DataHub users
owner_extraction_pattern: "(?:^|\\s)(@[\\w-]+)"
sink:
type: datahub-rest
config:
server: "http://datahub-gms:8080"
token: "${DATAHUB_TOKEN}"
Emitting Custom Metadata via Python SDK
For custom metadata that does not come from a supported source (e.g., ML feature importance scores, custom quality metrics), use the Python SDK to emit metadata change events directly.
import datahub.emitter.mce_builder as builder
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import (
DatasetPropertiesClass,
OwnershipClass,
OwnerClass,
OwnershipTypeClass,
GlobalTagsClass,
TagAssociationClass,
DatasetStatusClass,
)
emitter = DatahubRestEmitter(gms_server="http://datahub-gms:8080")
def register_dataset(
platform: str,
dataset_name: str,
owner_email: str,
description: str,
tags: list,
freshness_sla_hours: float,
is_deprecated: bool = False,
) -> None:
"""
Register a dataset in DataHub with full metadata.
Call this from your pipeline's deployment script to keep the catalog current.
"""
dataset_urn = builder.make_dataset_urn(
platform=platform,
name=dataset_name,
env="PROD",
)
# Dataset properties: description, custom metadata
properties_mce = builder.make_lineage_mce(
upstream_urns=[],
downstream_urn=dataset_urn,
)
# Ownership: assign a technical owner
owner_urn = builder.make_user_urn(owner_email.split("@")[0])
ownership_mcp = builder.make_ownership_mcp(
entity_urn=dataset_urn,
owners=[OwnerClass(owner=owner_urn, type=OwnershipTypeClass.TECHNICAL_OWNER)],
)
# Tags: apply classification tags
tag_associations = [
TagAssociationClass(tag=builder.make_tag_urn(tag))
for tag in tags
]
tags_mcp = builder.make_global_tags_mcp(
entity_urn=dataset_urn,
tags=tag_associations,
)
# Deprecation status
if is_deprecated:
status_mcp = builder.make_status_mcp(
entity_urn=dataset_urn,
removed=True,
)
emitter.emit_mcp(status_mcp)
emitter.emit_mcp(ownership_mcp)
emitter.emit_mcp(tags_mcp)
print(f"Registered {dataset_name} in DataHub catalog")
# Register a new table when it is created
register_dataset(
platform="snowflake",
dataset_name="PROD_DB.MART.USER_PURCHASE_HISTORY",
description="Unified purchase history for all users across B2B and consumer channels. Updated hourly. Source: billing_db.invoices + crm_db.accounts.",
tags=["purchase", "user-data", "pii", "production"],
freshness_sla_hours=2.0,
)
Apache Atlas: Hadoop Ecosystem Catalog
Apache Atlas is the standard metadata and governance framework for the Hadoop ecosystem. While DataHub targets the modern data stack (Snowflake, dbt, Kafka), Atlas is deeply integrated with HBase, Hive, HBase, Ranger, and Ambari.
Atlas provides two features that differentiate it in enterprise settings:
Tag propagation: you can tag a Hive table as "contains PII" and Atlas automatically propagates that tag to all downstream tables and views that derive from it. This is powerful for compliance - when new data assets are created from PII sources, they are automatically classified.
Ranger integration: Atlas classifications integrate directly with Apache Ranger for access control. Tag a column as encrypted_pii and Ranger automatically applies column masking policies to all queries that access it.
For organizations not running the Hadoop stack, DataHub or Amundsen are better choices. For Cloudera Data Platform or Hortonworks HDP environments, Atlas is the native solution.
Business Glossary: Connecting Technical to Business
A data catalog without a business glossary is a technical inventory. A catalog with a business glossary is a shared language between engineering and the business.
The business glossary maps technical terms (table names, column names) to business definitions that non-engineers can understand. It also establishes canonical definitions - when different teams use the same term to mean different things, the glossary is where the authoritative definition lives.
# Example: programmatically managing a business glossary in DataHub
from datahub.emitter.rest_emitter import DatahubRestEmitter
import datahub.emitter.mce_builder as builder
GLOSSARY_TERMS = [
{
"term": "lifetime_value_90d",
"definition": "The predicted total revenue from a customer in the 90 days following their acquisition date. Computed by the ml-ltv-model pipeline. Updated daily. Owner: Growth Analytics team.",
"related_columns": [
"mart.user_features.ltv_90d",
"mart.growth_metrics.predicted_ltv_90d",
],
},
{
"term": "active_user",
"definition": "A user who has performed at least one qualifying engagement event in the trailing 30 days. 'Qualifying events' are defined in events.event_taxonomy. Note: this definition changed in Q2 2023 - sessions under 10 seconds were excluded. Historical data before 2023-04-01 uses the previous definition.",
"related_columns": [
"mart.user_status.is_active_30d",
"mart.growth_metrics.mau",
],
},
{
"term": "arr",
"definition": "Annual Recurring Revenue. Sum of all active subscription invoices annualized. B2B contracts only - does not include consumer subscriptions. Source: billing_db.invoices WHERE contract_type = 'subscription'. Finance team definition aligned with GAAP recognition rules as of 2022.",
"related_columns": [
"mart.revenue_metrics.arr_usd",
"reporting.executive_kpis.arr",
],
},
]
def register_glossary_terms(emitter: DatahubRestEmitter, terms: list) -> None:
for term_data in terms:
term_urn = f"urn:li:glossaryTerm:{term_data['term']}"
# Emit the glossary term with definition
glossary_mcp = builder.make_glossary_term_mcp(
term_urn=term_urn,
definition=term_data["definition"],
)
emitter.emit_mcp(glossary_mcp)
# Associate the term with each related column
for col_ref in term_data.get("related_columns", []):
parts = col_ref.rsplit(".", 1)
if len(parts) == 2:
dataset_urn = builder.make_dataset_urn("snowflake", parts[0].upper(), "PROD")
# Attach glossary term to the column on this dataset
# (actual API call depends on DataHub SDK version)
print(f" Linked {term_data['term']} → {col_ref}")
print(f"Registered glossary term: {term_data['term']}")
Automated Metadata Collection: Table Profiling
A catalog populated only with schema information (column names and types) has limited value. The high-value metadata that drives data discovery is statistical: what are the common values in this column? What is the typical range of numeric fields? What fraction of rows are null?
Table profiling runs statistical queries against each table to collect this information automatically.
import psycopg2
import json
from typing import Dict, Any, List
from datetime import datetime
def profile_table(conn, table: str, sample_pct: float = 10.0) -> Dict[str, Any]:
"""
Run a statistical profile of all columns in a table.
Profiles at a sample percentage to limit warehouse compute cost.
Stores results in the catalog.
"""
schema_name, table_name = table.split(".")
# Get column list and types
with conn.cursor() as cur:
cur.execute("""
SELECT column_name, data_type
FROM information_schema.columns
WHERE table_schema = %s AND table_name = %s
ORDER BY ordinal_position
""", (schema_name, table_name))
columns = cur.fetchall()
profile = {
"table": table,
"profiled_at": datetime.utcnow().isoformat(),
"sample_pct": sample_pct,
"columns": {},
}
for col_name, col_type in columns:
col_profile = {"type": col_type}
# Null rate - always compute
with conn.cursor() as cur:
cur.execute(f"""
SELECT
COUNT(*) AS total,
SUM(CASE WHEN {col_name} IS NULL THEN 1 ELSE 0 END) AS nulls,
COUNT(DISTINCT {col_name}) AS distinct_count
FROM {table}
TABLESAMPLE BERNOULLI({sample_pct})
""")
row = cur.fetchone()
if row and row[0] > 0:
col_profile["null_rate"] = round(row[1] / row[0], 4)
col_profile["distinct_count"] = row[2]
col_profile["total_sampled"] = row[0]
# Numeric stats for numeric columns
if any(t in col_type for t in ["int", "float", "numeric", "decimal", "double"]):
with conn.cursor() as cur:
cur.execute(f"""
SELECT
MIN({col_name}::float),
MAX({col_name}::float),
AVG({col_name}::float),
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY {col_name}::float),
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY {col_name}::float)
FROM {table}
TABLESAMPLE BERNOULLI({sample_pct})
WHERE {col_name} IS NOT NULL
""")
stats = cur.fetchone()
if stats and stats[0] is not None:
col_profile["min"] = stats[0]
col_profile["max"] = stats[1]
col_profile["mean"] = round(stats[2], 4)
col_profile["median"] = stats[3]
col_profile["p95"] = stats[4]
# Top values for low-cardinality columns
if col_profile.get("distinct_count", 9999) < 50:
with conn.cursor() as cur:
cur.execute(f"""
SELECT {col_name}::text, COUNT(*) AS freq
FROM {table}
TABLESAMPLE BERNOULLI({sample_pct})
WHERE {col_name} IS NOT NULL
GROUP BY {col_name}
ORDER BY freq DESC
LIMIT 10
""")
col_profile["top_values"] = [
{"value": r[0], "count": r[1]}
for r in cur.fetchall()
]
profile["columns"][col_name] = col_profile
return profile
Data Ownership and Stewardship
A catalog entry without an owner is nearly worthless in a production incident. "Who do I call when this table is wrong?" is the first question during a data incident triage. If the answer is "nobody knows," the incident takes 3x longer to resolve.
Every table in a production catalog should have:
- Technical owner: the team or engineer responsible for the pipeline that produces this table. They are on-call for pipeline failures.
- Business owner: the team that consumes and depends on this table. They define the freshness SLA and quality requirements.
- Steward (optional): an analyst or data scientist who can answer semantic questions about the data - "what does this column mean? when was the definition last changed?"
Ownership should be machine-readable so that alerting systems can route incidents to the right person. Integrating the catalog with PagerDuty means that a freshness alert can automatically notify the technical owner without anyone having to look up who that is.
Active Metadata: Turning Catalog Data Into Actions
The first generation of data catalogs was purely informational - they stored metadata and displayed it. Active metadata takes this further: the catalog does not just store information about data, it acts on it.
Common active metadata patterns:
| Trigger | Action |
|---|---|
| Freshness SLA violated | Create Jira ticket, assign to technical owner |
| Schema change detected | Notify all downstream consumers via Slack |
| PII tag applied to column | Automatically restrict access via Ranger/IAM |
| New table created without owner | Send onboarding checklist to pipeline author |
| Quality check fails for N consecutive days | Escalate from Slack to PagerDuty |
# DataHub Actions Framework - active metadata example
# Trigger a Slack notification when any PII column is added to a production table
from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph
def on_schema_change_event(event: dict, slack_webhook: str) -> None:
"""
Called by DataHub Actions framework when a schema change event fires.
Sends Slack alert if any new column has PII classification.
"""
import requests
dataset_urn = event.get("entityUrn", "")
changed_columns = event.get("aspectValue", {}).get("fields", [])
pii_columns = []
for col in changed_columns:
tags = col.get("tags", {}).get("tags", [])
for tag in tags:
if "pii" in tag.get("tag", "").lower():
pii_columns.append(col.get("fieldPath", "unknown"))
if pii_columns:
message = {
"text": (
f"*PII Schema Change Alert*\n"
f"Dataset: `{dataset_urn}`\n"
f"New PII columns detected: `{', '.join(pii_columns)}`\n"
f"Action required: review access controls and data retention policies."
)
}
requests.post(slack_webhook, json=message)
Search and Discovery at Scale
The value of a catalog is only realized if engineers can actually find what they are looking for. Search quality matters: a data scientist searching "user purchase" should find mart.user_purchase_history as the first result, not a list of 30 tables with "user" or "purchase" somewhere in their name.
DataHub uses Elasticsearch for full-text search across table names, column names, descriptions, business glossary terms, and sample values. It supports:
- Fuzzy matching: "purchas" finds "purchase_history"
- Column-level search: searching "invoice_ts" finds every table that has a column named
invoice_ts - Faceted filtering: filter by platform (Snowflake), environment (PROD), owner, tags
- Popularity ranking: tables accessed more frequently by more users rank higher
:::danger A catalog with stale metadata is worse than no catalog A catalog that shows deprecated tables as current, shows wrong ownership, or shows descriptions from two years ago creates active misinformation. Engineers learn quickly that the catalog cannot be trusted and stop using it. The consequence is that you spent engineering time building a catalog that nobody uses. Automated ingestion is not optional - it is the mechanism that keeps the catalog trustworthy. Every source system should have scheduled ingestion configured so that the catalog is never more than 24 hours out of date. :::
:::warning Catalog adoption requires cultural change, not just tooling Building a catalog is technically straightforward. Getting engineers to use it, keep descriptions up to date, and set ownership is a cultural challenge. The most effective approach is to make the catalog the default entry point for data access: link every Slack alert to the catalog entry for the affected table, link dbt model documentation to the catalog, link BI tool datasets to catalog entries. When the catalog is the most useful place to go for information about a data asset, adoption follows naturally. :::
Interview Q&A
Q: What is a data catalog, and why does it become necessary at scale?
A: A data catalog is a centralized inventory of all data assets in an organization - what tables exist, what their columns mean, who owns them, when they were last updated, what quality checks they pass, and how they relate to each other through lineage. It becomes necessary at scale for two reasons. First, the discovery problem: with hundreds or thousands of tables, engineers cannot rely on tribal knowledge to find the right data asset. Without a catalog, onboarding a new data scientist can take weeks of asking around instead of days. Second, the trust problem: without ownership and quality metadata, engineers cannot tell whether a table is production-grade or an abandoned experiment, whether it is current or stale, whether its definition has changed since they last used it.
Q: Compare DataHub and Apache Atlas. When would you choose each?
A: DataHub targets the modern data stack - it has connectors for Snowflake, BigQuery, Redshift, dbt, Airflow, Looker, Tableau, and Kafka. It is built for cloud-native architectures and is easier to operate. Apache Atlas is built for the Hadoop ecosystem - it has deep integration with Hive, HBase, Ranger, and Cloudera. Atlas's key advantage is tag propagation (PII tags automatically propagate to derived tables) and Ranger integration for column-level access control. I would choose DataHub for any organization using the modern cloud data stack, and Atlas only if the organization is running Cloudera Data Platform or heavily invested in the Hadoop ecosystem.
Q: What is a business glossary, and why is it important for a data platform?
A: A business glossary maps technical column names to business definitions, and resolves definitional conflicts between teams. For example, "active user" may mean different things to the growth team, the product team, and the finance team. The business glossary establishes the canonical definition and links it to the specific columns that implement it. This is important because ambiguous definitions cause incorrect analysis - if a data scientist uses the growth team's definition of "active user" for a report that finance expects to use the GAAP-aligned definition, the numbers will not reconcile. The glossary makes these definitional choices explicit, documented, and searchable.
Q: What is active metadata, and how does it differ from passive metadata?
A: Passive metadata is stored information about data assets - descriptions, schemas, owners, lineage. It is informational: you query it to learn about a data asset. Active metadata uses the catalog as a trigger for automated actions. When a freshness SLA is violated, active metadata can automatically create a Jira ticket and assign it to the technical owner - without anyone having to manually check the catalog and respond. When a schema change adds a PII column, active metadata can automatically trigger an access control review. The shift from passive to active metadata turns the catalog from a documentation system into an operational system that drives data reliability actions automatically.
Q: How would you set up DataHub ingestion for a dbt + Snowflake stack?
A: I would configure two ingestion recipes. The first ingests from Snowflake directly - it captures table schemas, row counts, column statistics, and query usage patterns. This runs on a schedule (daily or hourly for busy warehouses). The second ingests from dbt using the dbt manifest, catalog, and run results JSON files. The dbt ingestion captures model documentation, column descriptions, owner tags, test results, and column-level lineage derived from dbt's compiled SQL. I would run the dbt ingestion after every dbt production run so that quality check results are always current in the catalog. I would also configure the ownership extraction pattern so that dbt model owners (defined via config(meta={"owner": "team-name"})) are automatically linked to DataHub user entities.
Q: How do you handle catalog data for deprecated tables?
A: Deprecated tables need to be explicitly marked in the catalog with a deprecation notice that includes: (1) the deprecation date, (2) the replacement table (with a direct link), and (3) the reason for deprecation. This prevents new consumers from using the old table while giving existing consumers a clear migration path. In DataHub, you set the DatasetStatusClass.removed = True flag and add a deprecation note to the dataset properties. The UI renders deprecated tables with a visual indicator so they cannot be accidentally mistaken for production tables. I also recommend keeping deprecated tables in the catalog (not deleting them) for a transition period, because historical lineage graphs still reference them and removing them creates broken lineage edges.
