What is cloud cost optimization?

What query optimisation, storage tiering, and cloud cost controls do for AI systems, when large-scale model training and feature computation drive unpredictable cloud spend, and how to implement cost reduction strategies in production AI data pipelines.

How does query optimization work in practice?

Data Platform Cost Optimisation for AI Teams covers cloud cost optimization, query optimization, storage tiering from first principles with code examples. Free lesson at https://engineersofai.com/docs/data-engineering/cloud-data-platforms/data-platform-cost-optimisation

What is the difference between cloud cost optimization and storage tiering?

See the full breakdown at https://engineersofai.com/docs/data-engineering/cloud-data-platforms/data-platform-cost-optimisation

:::tip 🎮 Interactive Playground Visualize this concept: Try the Build vs Buy demo on the EngineersOfAI Playground - no code required. :::

Data Platform Cost Optimisation for AI Teams

A company's cloud data bill grew from $40,000/month to$ 180,000/month in 18 months. Nobody had deliberately changed the architecture. They had added 6 new data scientists, doubled their raw data volume, enabled 3 more BI dashboards, and started running weekly model retraining jobs. The spending grew organically, invisibly, across dozens of services and hundreds of jobs.

A two-week cost optimization sprint brought the bill back to $65,000/month - a 64% reduction - without touching a single feature or breaking any pipeline. The savings were in five places:$ 45,000/month in idle warehouse capacity that was auto-resuming for infrequent queries and never auto-suspending fast enough; $35,000/month in full-table scans running on Snowflake against 2-TB tables because engineers had not added partition filters;$ 20,000/month in hot storage tier for cold data that had not been accessed in 6 months; $15,000/month in duplicate feature pipelines that three different teams had independently built for the same features; and$ 10,000/month in always-on compute clusters used only for overnight batch jobs.

None of these problems required clever engineering to fix. They required visibility - knowing where the money was going - and discipline - enforcing good habits before bad habits compounded.

This lesson covers the five categories of cloud data waste, concrete techniques to eliminate each one, the FinOps practices that keep costs under control as teams scale, and the cost optimization checklist that you can run against any data platform today.

Why This Exists - The Invisible Tax of Cloud Infrastructure

On-premises infrastructure had a different cost model: you bought servers, they depreciated over 5 years, and the marginal cost of running an extra query was zero. Engineers did not think about the cost of a SELECT statement because the server was paid for regardless.

Cloud infrastructure flips this model: you pay for every compute-second, every byte stored, every byte transferred, and every API call. The marginal cost of a bad query is real money. A missing partition filter on BigQuery does not just slow down a dashboard - it charges $50 for a query that should cost$ 0.05.

The challenge is that cloud cost is diffuse. Each service has its own billing model. Costs accumulate from hundreds of small decisions made by dozens of engineers. Without explicit cost governance, the bill grows faster than the data.

The Five Categories of Cloud Data Waste

Category 1 - Idle Compute

Idle compute is the simplest waste to understand and the easiest to fix. A warehouse or cluster that is running but not processing queries is burning money.

Snowflake: Aggressive Auto-Suspend

The default Snowflake auto-suspend is 10 minutes. For ML feature engineering warehouses that run once per day, 10 minutes of idle time is 10 minutes of XL warehouse credits:

-- Check current auto-suspend settings for all warehouses
SELECT name, size, auto_suspend, auto_resume, state
FROM information_schema.warehouses
ORDER BY name;

-- Fix: set aggressive auto-suspend on all non-critical warehouses
ALTER WAREHOUSE ml_feature_wh     SET AUTO_SUSPEND = 60;   -- 1 minute
ALTER WAREHOUSE analytics_wh      SET AUTO_SUSPEND = 60;
ALTER WAREHOUSE dev_wh            SET AUTO_SUSPEND = 60;

-- Dashboard/BI warehouses can have slightly higher suspend (restart latency matters more)
ALTER WAREHOUSE dashboard_wh      SET AUTO_SUSPEND = 300;  -- 5 minutes

-- Check the cost of idle warehouses in the last 30 days
SELECT
    warehouse_name,
    SUM(credits_used_compute)  AS compute_credits,
    SUM(credits_used_cloud_services) AS service_credits,
    ROUND(SUM(credits_used_compute) * 3.0, 2) AS estimated_cost_usd
FROM snowflake.account_usage.warehouse_metering_history
WHERE start_time >= DATEADD(DAY, -30, CURRENT_TIMESTAMP())
GROUP BY warehouse_name
ORDER BY compute_credits DESC;

BigQuery: Committed vs. On-Demand Cost Modeling

For BigQuery, "idle compute" manifests differently - flat-rate reservations that are underutilized:

from google.cloud import bigquery_reservation_v1

client = bigquery_reservation_v1.ReservationServiceClient()

# List all reservations and check slot utilization
parent = "projects/my-project/locations/us"
reservations = list(client.list_reservations(parent=parent))
for res in reservations:
    print(f"Reservation: {res.name}")
    print(f"  Slot capacity: {res.slot_capacity}")
    # If avg_slot_utilization << slot_capacity, you're paying for idle slots

-- BigQuery: check slot utilization vs. reserved capacity
-- (from INFORMATION_SCHEMA in the region)
SELECT
    DATE(creation_time)        AS query_date,
    SUM(total_slot_ms) / 1000 / 3600  AS total_slot_hours,
    COUNT(*)                          AS query_count,
    SUM(total_bytes_processed) / 1e12 AS tb_processed
FROM `region-us`.INFORMATION_SCHEMA.JOBS
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
  AND job_type = 'QUERY'
GROUP BY 1
ORDER BY 1 DESC;

Databricks: Cluster Auto-Termination Enforcement

# Use Databricks cluster policies to enforce auto-termination
import requests

policy_config = {
    "name": "cost-control-policy",
    "definition": {
        "autotermination_minutes": {
            "type": "fixed",
            "value": 30,
            "hidden": False
        },
        "aws_attributes.availability": {
            "type": "allowlist",
            "values": ["SPOT_WITH_FALLBACK", "SPOT"],
            "defaultValue": "SPOT_WITH_FALLBACK"
        }
    }
}

Category 2 - Inefficient Queries

Inefficient queries are the most expensive and most common form of waste. A single engineer running SELECT * FROM events without a date filter on a 2 TB table generates $10 in BigQuery charges in 4 seconds. Multiply that by 10 engineers running 5 queries per day and you have$ 500/day in unnecessary query costs.

Partition Filter Enforcement

-- EXPENSIVE: full table scan on BigQuery (2 TB = $10)
SELECT user_id, COUNT(*) AS events
FROM events.user_events
WHERE event_type = 'purchase';  -- no date filter - scans entire table

-- CHEAP: partition-pruned query (5 GB = $0.025)
SELECT user_id, COUNT(*) AS events
FROM events.user_events
WHERE event_date = '2026-03-11'   -- partition filter: only scans 1 partition
  AND event_type = 'purchase';

Enforce partition filters: most cloud warehouses can require partition filter predicates on partitioned tables:

-- Snowflake: require partition filters (via clustering key enforcement at design time)
-- BigQuery: require partition filter at table level
ALTER TABLE events.user_events
  SET OPTIONS (require_partition_filter = TRUE);

-- Now this query fails with a helpful error instead of silently scanning everything:
SELECT * FROM events.user_events WHERE event_type = 'purchase';
-- Error: "Cannot query over table without a filter that can be used for partition elimination"

Column Selection

import pandas as pd
from google.cloud import bigquery

client = bigquery.Client()

# EXPENSIVE: SELECT * reads all 50 columns (2 TB scan)
bad_query = """
    SELECT *
    FROM events.user_events
    WHERE event_date = '2026-03-11'
"""

# CHEAP: SELECT only the 3 columns you need (~200 GB scan)
good_query = """
    SELECT user_id, event_type, revenue
    FROM events.user_events
    WHERE event_date = '2026-03-11'
"""

# Check cost before running (dry run)
def estimate_cost(query: str) -> float:
    job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)
    dry_run_job = client.query(query, job_config=job_config)
    return dry_run_job.total_bytes_processed / 1e12 * 5  # $5 per TB

print(f"SELECT * cost: ${estimate_cost(bad_query):.2f}")
print(f"SELECT columns cost: ${estimate_cost(good_query):.2f}")

Materialized Views for Expensive Repeated Queries

Expensive queries that run on a schedule should be replaced with materialized views. The view is computed once and refreshed periodically; all queries that match the view pattern are served from the pre-computed result:

-- Expensive: daily churn metrics dashboard query (runs 200x/day for different users)
-- Cost: 200 runs * $10/run = $2,000/day

-- Solution: materialized view (computed once per hour, queries served from cache)
CREATE MATERIALIZED VIEW features.daily_churn_metrics
PARTITION BY metric_date
OPTIONS (enable_refresh = true, refresh_interval_minutes = 60)
AS
SELECT
    DATE(event_date)         AS metric_date,
    subscription_tier,
    COUNT(DISTINCT user_id)  AS active_users,
    AVG(events_30d)          AS avg_events_30d,
    AVG(revenue_30d)         AS avg_revenue_30d,
    SUM(CASE WHEN churned THEN 1 ELSE 0 END) AS churned_users
FROM features.user_churn_features
WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
GROUP BY 1, 2;

-- Now 200 dashboard queries/day hit the materialized view: ~$0.01/query instead of $10

Approximate Functions for Exploratory Work

-- Snowflake/BigQuery: use approximate aggregations during exploration
-- APPROX_COUNT_DISTINCT: ~2% error, 10-50x cheaper than COUNT(DISTINCT)
SELECT
    event_date,
    APPROX_COUNT_DISTINCT(user_id)          AS approx_unique_users,   -- fast, cheap
    APPROX_PERCENTILE(revenue, 0.5)         AS median_revenue,         -- fast, cheap
    APPROX_PERCENTILE(revenue, 0.95)        AS p95_revenue,
    APPROX_TOP_K(event_type, 5)             AS top_event_types
FROM events.user_events
WHERE event_date = CURRENT_DATE() - 1;

-- Exact versions for final production computations
SELECT
    event_date,
    COUNT(DISTINCT user_id)                 AS exact_unique_users,    -- slow, expensive
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY revenue) AS exact_median_revenue
FROM events.user_events
WHERE event_date = CURRENT_DATE() - 1;

Category 3 - Wrong Storage Tier

Cloud storage tiers are not all priced equally. The difference between S3 Standard ( $0.023/GB/month) and S3 Glacier Deep Archive ($ 0.00099/GB/month) is 23x. For 100 TB of data that has not been accessed in 12 months, the cost difference is $2,300/month vs.$ 99/month.

S3 Lifecycle Policies

import boto3

s3_client = boto3.client("s3")

# Comprehensive lifecycle policy for an ML data lake
lifecycle_config = {
    "Rules": [
        {
            "ID": "raw-data-tiering",
            "Status": "Enabled",
            "Filter": {"Prefix": "raw/"},
            "Transitions": [
                {"Days": 7,   "StorageClass": "STANDARD_IA"},     # after 1 week
                {"Days": 30,  "StorageClass": "GLACIER_IR"},        # after 1 month
                {"Days": 180, "StorageClass": "DEEP_ARCHIVE"}       # after 6 months
            ]
        },
        {
            "ID": "training-data-tiering",
            "Status": "Enabled",
            "Filter": {"Prefix": "training-data/"},
            "Transitions": [
                {"Days": 14,  "StorageClass": "STANDARD_IA"},
                {"Days": 60,  "StorageClass": "GLACIER_IR"}
                # Keep in Glacier IR (instant retrieval) for potential retraining
            ],
            "Expiration": {"Days": 365}  # delete training data after 1 year
        },
        {
            "ID": "athena-results-cleanup",
            "Status": "Enabled",
            "Filter": {"Prefix": "athena-results/"},
            "Expiration": {"Days": 7}    # delete query results after 7 days
        },
        {
            "ID": "feature-store-hot",
            "Status": "Enabled",
            "Filter": {"Prefix": "features/"},
            # Keep features in Standard - accessed daily for training
            # No transition - features are hot data
            "AbortIncompleteMultipartUpload": {"DaysAfterInitiation": 1}
        }
    ]
}

s3_client.put_bucket_lifecycle_configuration(
    Bucket="my-data-lake",
    LifecycleConfiguration=lifecycle_config
)
print("Lifecycle policy applied")

VACUUM for Delta Lake and Iceberg

Delta Lake and Iceberg keep old file versions for time travel, but each version consumes storage. VACUUM removes old data files that are no longer needed:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .config("spark.databricks.delta.retentionDurationCheck.enabled", "true") \
    .getOrCreate()

# Check how many old files a Delta table has accumulated
table_path = "s3://my-data-lake/features/user_churn_features"

detail = spark.sql(f"DESCRIBE DETAIL delta.`{table_path}`")
detail.select("numFiles", "sizeInBytes").show()

# Run VACUUM to remove files older than the retention period
# Default retention is 7 days (for time travel compatibility)
spark.sql(f"""
    VACUUM delta.`{table_path}`
    RETAIN 168 HOURS   -- 7 days
""")

# Check size reduction
detail_after = spark.sql(f"DESCRIBE DETAIL delta.`{table_path}`")
detail_after.select("numFiles", "sizeInBytes").show()

-- Snowflake: check Fail-safe and Time Travel storage costs
SELECT
    table_catalog,
    table_schema,
    table_name,
    ROUND(active_bytes       / 1e9, 2) AS active_gb,
    ROUND(time_travel_bytes  / 1e9, 2) AS time_travel_gb,
    ROUND(failsafe_bytes     / 1e9, 2) AS failsafe_gb,
    ROUND((active_bytes + time_travel_bytes + failsafe_bytes) / 1e9, 2) AS total_gb
FROM information_schema.table_storage_metrics
ORDER BY total_gb DESC
LIMIT 20;

-- Reduce time travel retention on dev/staging schemas (less history needed)
ALTER SCHEMA features_dev SET DATA_RETENTION_TIME_IN_DAYS = 1;

BigQuery Long-Term Storage

BigQuery automatically applies a 50% storage discount (from $0.020/GB to$ 0.010/GB) to tables that have not been modified for 90 consecutive days. This is automatic - no action required. But awareness is useful: splitting frequently-updated tables from archival tables maximizes the discount.

-- Identify BigQuery tables eligible for or already receiving long-term storage pricing
SELECT
    table_schema,
    table_name,
    ROUND(size_bytes / 1e9, 2)          AS size_gb,
    last_modified_time,
    DATE_DIFF(CURRENT_DATE(), DATE(last_modified_time), DAY) AS days_since_modified,
    CASE
        WHEN DATE_DIFF(CURRENT_DATE(), DATE(last_modified_time), DAY) >= 90
        THEN ROUND(size_bytes / 1e9 * 0.010, 2)   -- long-term rate
        ELSE ROUND(size_bytes / 1e9 * 0.020, 2)   -- active rate
    END AS estimated_monthly_cost_usd
FROM `region-us`.INFORMATION_SCHEMA.TABLE_STORAGE
ORDER BY size_bytes DESC
LIMIT 20;

Category 4 - Data Duplication

Data duplication is the least visible form of waste. It accumulates when teams independently solve the same problem: three ML teams all build a "30-day rolling revenue" feature, store it in three different tables, and maintain three separate pipelines. The total cost is 3x the storage, 3x the compute, and 3x the maintenance burden - plus inconsistency when the three pipelines diverge.

Audit Duplicate Feature Pipelines

import boto3
import json
from collections import defaultdict

# Find tables with identical or near-identical schemas in Glue Catalog
glue = boto3.client("glue", region_name="us-east-1")

paginator = glue.get_paginator("get_tables")
tables = []

for db in ["features_ml_team_a", "features_ml_team_b", "features_analytics", "features_data_platform"]:
    try:
        for page in paginator.paginate(DatabaseName=db):
            for table in page["TableList"]:
                columns = frozenset(
                    c["Name"] for c in table["StorageDescriptor"]["Columns"]
                )
                tables.append({
                    "database": db,
                    "table": table["Name"],
                    "column_set": columns,
                    "num_columns": len(columns)
                })
    except glue.exceptions.EntityNotFoundException:
        pass

# Find tables with identical column sets (potential duplicates)
column_sets: dict[frozenset, list] = defaultdict(list)
for t in tables:
    column_sets[t["column_set"]].append(f"{t['database']}.{t['table']}")

print("\n=== Potentially Duplicate Feature Tables ===")
for col_set, table_list in column_sets.items():
    if len(table_list) > 1:
        print(f"\nTables with identical schema ({len(col_set)} columns):")
        for t in table_list:
            print(f"  - {t}")

Canonical Dataset Strategy

The solution to duplication is a canonical dataset owned by the Data Platform team, with all other teams consuming it rather than building their own:

-- Before: 3 teams each compute their own 30-day revenue feature
-- ml_team_a.features.user_revenue_30d    (5 GB table, daily refresh)
-- ml_team_b.features.rolling_revenue     (4.8 GB table, daily refresh)
-- analytics.feature_store.revenue_roll   (5.1 GB table, daily refresh)
-- Total: 14.9 GB storage, 3 compute jobs, 3 different values for the same metric

-- After: 1 canonical feature table, 3 consumers with SELECT grants
-- data_platform.features.user_churn_features  (5 GB table, daily refresh)
-- ml_team_a reads via: SELECT ... FROM data_platform.features.user_churn_features
-- ml_team_b reads via: SELECT ... FROM data_platform.features.user_churn_features
-- analytics reads via: SELECT ... FROM data_platform.features.user_churn_features
-- Total: 5 GB storage, 1 compute job, 1 consistent value

Category 5 - Over-Provisioning

Over-provisioning occurs when warehouse or cluster size exceeds what the workload actually needs.

Warehouse Right-Sizing in Snowflake

The key insight: for a given query, a larger warehouse runs it faster but costs the same total credits if the time scales inversely with the size. An XL warehouse running a 2-minute query costs the same as a Medium warehouse running an 8-minute query - both consume 2 credit-minutes of XL or 4 credit-minutes of Medium... actually not equal, because XL is 4x Medium credits/hour.

The right-sizing analysis:

-- Find queries that ran on oversized warehouses (query was short, warehouse was large)
SELECT
    warehouse_name,
    warehouse_size,
    query_text,
    total_elapsed_time / 1000        AS duration_seconds,
    credits_used_cloud_services,
    -- If duration < 10 seconds on an XL warehouse, probably oversized
    CASE
        WHEN warehouse_size = 'X-Large' AND total_elapsed_time < 10000 THEN 'OVERSIZED'
        WHEN warehouse_size = 'Large'   AND total_elapsed_time < 5000  THEN 'POSSIBLY_OVERSIZED'
        ELSE 'OK'
    END AS sizing_assessment
FROM snowflake.account_usage.query_history
WHERE start_time >= DATEADD(DAY, -7, CURRENT_TIMESTAMP())
  AND warehouse_size IN ('Large', 'X-Large', '2X-Large')
  AND total_elapsed_time < 15000   -- queries under 15 seconds
ORDER BY warehouse_size DESC, total_elapsed_time ASC
LIMIT 50;

Spot / Preemptible Instances for Batch ML

Spot instances (AWS) and preemptible instances (GCP) offer 60-80% cost reduction for interruptible workloads. ML training jobs on large datasets - which already need checkpoint logic for fault tolerance - are ideal candidates:

import boto3
from botocore.config import Config

# EMR Serverless with spot instances for batch feature engineering
emr_serverless = boto3.client("emr-serverless", region_name="us-east-1")

job_response = emr_serverless.start_job_run(
    applicationId="app-id",
    executionRoleArn="arn:aws:iam::123456789:role/EMRServerlessRole",
    jobDriver={
        "sparkSubmit": {
            "entryPoint": "s3://scripts/feature_engineering.py",
            "sparkSubmitParameters": (
                "--conf spark.dynamicAllocation.enabled=true "
                "--conf spark.dynamicAllocation.minExecutors=5 "
                "--conf spark.dynamicAllocation.maxExecutors=50"
            )
        }
    },
    configurationOverrides={
        "applicationConfiguration": [
            {
                "classification": "spark-defaults",
                "properties": {
                    # EMR Serverless uses spot capacity automatically for cost savings
                    "spark.sql.adaptive.enabled": "true",
                    "spark.sql.adaptive.coalescePartitions.enabled": "true"
                }
            }
        ]
    }
)

# Databricks: use spot instances for job clusters with fallback to on-demand
job_cluster_config = {
    "new_cluster": {
        "spark_version": "15.4.x-scala2.12",
        "node_type_id": "i3.2xlarge",
        "num_workers": 10,
        "aws_attributes": {
            "availability": "SPOT_WITH_FALLBACK",  # try spot first, fall back to on-demand
            "spot_bid_price_percent": 100,          # bid up to 100% of on-demand price
            "first_on_demand": 1                    # keep driver on on-demand for stability
        }
    }
}

FinOps for Data Platforms

Tagging Strategy

Without tags, cost reports show total spend but not who spent it or why. A consistent tagging strategy enables per-team and per-project cost attribution:

import boto3

# Apply consistent tags to all data platform resources
resource_tags = [
    {"Key": "team",        "Value": "data-platform"},
    {"Key": "project",     "Value": "churn-model-v2"},
    {"Key": "environment", "Value": "production"},
    {"Key": "cost-center", "Value": "ml-eng"},
    {"Key": "owner",       "Value": "[email protected]"}
]

# Tag S3 buckets
s3 = boto3.client("s3")
s3.put_bucket_tagging(
    Bucket="my-data-lake",
    Tagging={"TagSet": resource_tags}
)

# Tag Glue jobs
glue = boto3.client("glue")
glue.tag_resource(
    ResourceArn="arn:aws:glue:us-east-1:123456789:job/feature-engineering",
    TagsToAdd={t["Key"]: t["Value"] for t in resource_tags}
)

-- Snowflake: tag warehouses and use query tags for attribution
ALTER WAREHOUSE ml_feature_wh SET TAG (
    team = 'ml-engineering',
    project = 'churn-model',
    cost_center = 'ml-eng'
);

-- Tag individual queries for granular attribution
ALTER SESSION SET QUERY_TAG = '{"team":"ml","project":"churn-v2","env":"prod","owner":"alice"}';

Budget Alerts

import boto3

budgets = boto3.client("budgets", region_name="us-east-1")

# Create a monthly budget with alerts
budgets.create_budget(
    AccountId="123456789",
    Budget={
        "BudgetName": "ml-data-platform-monthly",
        "BudgetLimit": {"Amount": "65000", "Unit": "USD"},
        "TimeUnit": "MONTHLY",
        "BudgetType": "COST",
        "CostFilters": {
            "TagKeyValue": ["user:team$data-platform", "user:team$ml-engineering"]
        }
    },
    NotificationsWithSubscribers=[
        {
            "Notification": {
                "NotificationType": "ACTUAL",
                "ComparisonOperator": "GREATER_THAN",
                "Threshold": 80,   # alert at 80% of budget
                "ThresholdType": "PERCENTAGE"
            },
            "Subscribers": [
                {"SubscriptionType": "EMAIL", "Address": "[email protected]"}
            ]
        },
        {
            "Notification": {
                "NotificationType": "FORECASTED",
                "ComparisonOperator": "GREATER_THAN",
                "Threshold": 100,  # alert if forecast exceeds budget
                "ThresholdType": "PERCENTAGE"
            },
            "Subscribers": [
                {"SubscriptionType": "EMAIL", "Address": "[email protected]"}
            ]
        }
    ]
)

Showback vs. Chargeback

Showback: report costs to teams without billing them. Teams see what they cost, but there is no financial consequence. Useful for building awareness.

Chargeback: actually bill costs to team budgets. Creates strong incentives for cost-conscious behavior but requires more governance overhead (disputed charges, budget allocation processes).

Most organizations start with showback and move to chargeback once teams have built cost awareness and optimization habits. The transition often produces a 20-30% cost reduction as teams suddenly have financial skin in the game.

Platform Consolidation

Running multiple data platforms for the same data is expensive in ways that are not obvious:

Duplicate Cost	Example	Hidden Cost
Storage duplication	Same 5 TB in Snowflake, BigQuery, and S3	3x storage billing
Compute duplication	3 feature engineering jobs for same features	3x compute cost
Operational overhead	3 teams maintaining similar pipelines	Engineering time
Governance overhead	Access control in 3 separate systems	Security risk + audit cost
Inconsistency	Three sources of truth diverge over time	ML model reliability

The lakehouse pattern (S3/GCS + Iceberg/Delta + one query engine) often costs less than a warehouse + lake + feature store running separately - even if the warehouse and lake seem "free" because they were already bought.

A consolidation analysis:

# Simple consolidation ROI calculation
class PlatformCostModel:
    def __init__(self, name: str, monthly_costs: dict):
        self.name = name
        self.costs = monthly_costs

    @property
    def total(self) -> float:
        return sum(self.costs.values())

# Current state: fragmented
current = [
    PlatformCostModel("Snowflake Warehouse", {
        "compute": 45_000, "storage": 8_000, "credits": 12_000
    }),
    PlatformCostModel("AWS Glue + S3", {
        "glue_dpu": 12_000, "s3_storage": 3_500, "data_transfer": 1_500
    }),
    PlatformCostModel("Databricks (feature store only)", {
        "dbu": 18_000, "ec2": 7_000
    }),
    PlatformCostModel("Operations overhead (engineering hours)", {
        "ops": 25_000  # 2 engineers * $12,500/month allocated
    })
]

# Proposed: unified lakehouse
proposed = [
    PlatformCostModel("Databricks Unified Lakehouse", {
        "dbu": 35_000, "ec2": 15_000, "s3_storage": 6_000
    }),
    PlatformCostModel("Operations overhead", {
        "ops": 12_500  # 1 engineer (unified platform is simpler)
    })
]

current_total  = sum(p.total for p in current)
proposed_total = sum(p.total for p in proposed)

print(f"Current monthly cost:  ${current_total:,.0f}")
print(f"Proposed monthly cost: ${proposed_total:,.0f}")
print(f"Monthly savings:       ${current_total - proposed_total:,.0f}")
print(f"Annual savings:        ${(current_total - proposed_total) * 12:,.0f}")

Cost Optimization Checklist

#	Action	Typical Savings	Effort
1	Set auto-suspend to 60 seconds on all batch warehouses	10-20%	Low
2	Add partition filters to all queries on partitioned tables	30-50% on scan costs	Medium
3	Replace `SELECT *` with explicit column selection	20-40% on scan costs	Low
4	Enable result caching (on by default in Snowflake/BigQuery)	5-15%	None
5	Create materialized views for expensive repeated queries	40-80% on those queries	Medium
6	Apply S3 lifecycle policies (IA after 30d, Glacier after 90d)	30-60% on cold storage	Low
7	Run VACUUM on Delta tables monthly	10-30% on storage	Low
8	Reduce Snowflake time travel retention in dev (1 day)	5-15%	Low
9	Audit duplicate feature pipelines	20-40% on compute	High
10	Use spot/preemptible for all batch ML jobs	60-80% on those jobs	Medium
11	Apply cluster policies to enforce auto-termination	15-25%	Low
12	Tag all resources for cost attribution	Enables #13-15	Low
13	Implement per-team cost showback reports	10-20% (awareness)	Medium
14	Set budget alerts at 80% and 100% of target	Prevents overruns	Low
15	Right-size warehouses based on query duration analysis	10-20%	Medium
16	Delete unused Athena query result files weekly	2-5%	Low
17	Move BigQuery archival tables to native tables (no DML)	Auto 50% storage discount	None
18	Consolidate redundant data platforms	20-40%	High
19	Use APPROX functions for exploratory analysis	20-50% on exploration	Low
20	Implement chargeback (after showback)	20-30% additional	High

:::danger Cost Monitoring Without Cost Accountability Changes Nothing Showback reports that nobody reads, budget alerts that go to a shared Slack channel nobody monitors, and cost dashboards that update monthly - none of these change behavior. Cost optimization requires a named owner for each cost center, a weekly cost review meeting, and a direct escalation path when costs exceed budget. Without accountability, even the best monitoring infrastructure produces no savings. :::

:::warning Premature Optimization Is As Dangerous as No Optimization Aggressively tuning query performance before a pipeline is validated wastes engineering time. Optimize queries that have been proven to run in production on real data at real scale. A query that runs once per day and costs $5 is not worth 2 days of optimization. A query that runs 200 times per day and costs$ 50 per run is worth significant effort. :::

Interview Q&A

Q1: A company's Snowflake bill doubled in 6 months. What are the most likely causes and how do you diagnose them?

Start with the WAREHOUSE_METERING_HISTORY view in snowflake.account_usage:

SELECT warehouse_name,
       SUM(credits_used_compute) AS credits,
       ROUND(SUM(credits_used_compute) * 3.0, 2) AS cost_usd
FROM snowflake.account_usage.warehouse_metering_history
WHERE start_time >= DATEADD(MONTH, -6, CURRENT_TIMESTAMP())
GROUP BY 1 ORDER BY 2 DESC;

Then look at query history to find the most expensive queries:

SELECT query_text, warehouse_name, warehouse_size, total_elapsed_time/1000 AS seconds
FROM snowflake.account_usage.query_history
WHERE start_time >= DATEADD(MONTH, -1, CURRENT_TIMESTAMP())
  AND credits_used_cloud_services > 0.1
ORDER BY credits_used_cloud_services DESC
LIMIT 20;

Most likely causes for a 2x increase: a new team started using a shared warehouse without auto-suspend (idle credits), a new ETL pipeline runs full-table scans (missing partition filters), auto-suspend was accidentally removed from a warehouse, a new BI tool is materializing dashboards by running queries every 5 minutes on a large warehouse, or storage costs increased because time travel retention was extended.

Q2: Explain the difference between showback and chargeback. Which do you implement first and why?

Showback gives teams visibility into what they cost without any financial consequence - it is informational. Reports show "your team spent $15,000 on Snowflake last month," but the bill still goes to a central budget.

Chargeback actually allocates costs to team budgets - the ML team's cloud budget decreases by $15,000. This creates real incentives: teams optimize because optimization saves their budget, not just the company's abstract total.

You implement showback first because: (1) you need accurate cost attribution infrastructure (tagging, resource monitors) regardless of which model you use, (2) showback reveals the data you need to make chargebacks fair and unambiguous, and (3) it gives teams time to build cost-aware habits before facing financial consequences for costs they may not yet understand how to control.

The transition to chargeback typically happens 3-6 months after showback when teams have visibility into their costs and have started optimizing. The announcement of "starting next quarter, your team's cloud budget will be charged actual usage" reliably produces another 20-30% cost reduction as teams treat cloud resources like a real budget line item.

Q3: How do spot instances work and why are they appropriate for ML training but not for BI dashboards?

AWS Spot instances are spare EC2 capacity available at 60-90% discount compared to on-demand pricing. The trade-off: AWS can reclaim spot instances with a 2-minute warning when capacity is needed for on-demand customers.

ML training jobs handle this well because: (1) training pipelines can checkpoint to S3 periodically, (2) an interruption means resuming from the last checkpoint rather than restarting from zero, (3) training jobs are long-running (hours), so the cost savings are large in absolute terms, and (4) the 2-minute warning is enough time to save a checkpoint.

BI dashboards cannot use spot instances because: (1) a query that gets interrupted mid-execution fails - there is no checkpoint, (2) users expect sub-second response times for dashboard loads, (3) interruptions happen randomly and can occur during the CEO's board presentation, and (4) the query duration is seconds (not hours), so the cost savings from spot pricing are trivial in absolute terms.

The rule: use spot for batch (training, ETL, feature engineering), use on-demand for interactive (dashboards, APIs, serving endpoints).

Q4: What is a Snowflake resource monitor and how do you use it to prevent cost overruns?

A resource monitor sets a credit quota for a warehouse, a set of warehouses, or the entire account. When usage reaches configurable thresholds, Snowflake can notify via email, suspend running queries, or suspend the warehouse entirely.

CREATE RESOURCE MONITOR ml_team_monthly
    WITH CREDIT_QUOTA = 500     -- 500 credits/month for the ML team
    FREQUENCY = MONTHLY
    START_TIMESTAMP = IMMEDIATELY
    TRIGGERS
        ON 60 PERCENT DO NOTIFY           -- email at 60%
        ON 80 PERCENT DO NOTIFY           -- email at 80%
        ON 90 PERCENT DO SUSPEND          -- suspend warehouse at 90% (new queries queue)
        ON 100 PERCENT DO SUSPEND_IMMEDIATE;  -- immediately kill running queries at 100%

ALTER WAREHOUSE ml_feature_wh SET RESOURCE_MONITOR = ml_team_monthly;
ALTER WAREHOUSE ml_dev_wh     SET RESOURCE_MONITOR = ml_team_monthly;

Resource monitors are the Snowflake equivalent of AWS Budget Alerts - they give cost control at the warehouse level rather than account level. Best practice: one resource monitor per team or cost center, with alert recipients being both the team lead and the data platform team.

Q5: A data scientist asks why their query costs $50 on BigQuery when it only reads 3 columns. How do you explain this and what do you recommend?

The question reveals a misunderstanding of how columnar charging works. BigQuery charges based on bytes scanned, not rows or columns accessed - but only for the columns included in the SELECT or WHERE clause, thanks to columnar storage.

If the query costs $50 (10 TB scanned), the issue is almost certainly one of two things:

Missing partition filter: the query does not filter on the partition column (typically a date), so BigQuery scans all partitions. Even though only 3 columns are read, 3 columns across 2 years of data is still 10 TB.

The 3 columns are very wide: if each of the 3 columns is a 1 KB JSON blob and the table has 10 billion rows, 3 columns still yields 30 TB of scan.

The fix for the partition filter case:

-- Expensive: scans all partitions
SELECT user_id, event_type, revenue
FROM events.user_events
WHERE event_type = 'purchase';  -- no date filter

-- Cheap: scans only one partition
SELECT user_id, event_type, revenue
FROM events.user_events
WHERE event_date = '2026-03-11'  -- partition filter
  AND event_type = 'purchase';

Use --dry_run mode before running any query against large tables to check bytes scanned before incurring cost.

Why This Exists - The Invisible Tax of Cloud Infrastructure​

The Five Categories of Cloud Data Waste​

Category 1 - Idle Compute​

Snowflake: Aggressive Auto-Suspend​

BigQuery: Committed vs. On-Demand Cost Modeling​

Databricks: Cluster Auto-Termination Enforcement​

Category 2 - Inefficient Queries​

Partition Filter Enforcement​

Column Selection​

Materialized Views for Expensive Repeated Queries​

Approximate Functions for Exploratory Work​

Category 3 - Wrong Storage Tier​

S3 Lifecycle Policies​

VACUUM for Delta Lake and Iceberg​

BigQuery Long-Term Storage​

Category 4 - Data Duplication​

Audit Duplicate Feature Pipelines​

Canonical Dataset Strategy​

Category 5 - Over-Provisioning​

Warehouse Right-Sizing in Snowflake​

Spot / Preemptible Instances for Batch ML​

FinOps for Data Platforms​

Tagging Strategy​

Budget Alerts​

Showback vs. Chargeback​

Platform Consolidation​

Cost Optimization Checklist​

Interview Q&A​