Skip to main content

:::tip 🎮 Interactive Playground Visualize this concept: Try the Data Lakehouse Architecture demo on the EngineersOfAI Playground - no code required. :::

Snowflake for ML

A data science team was running feature engineering queries on a shared 8-core PostgreSQL instance. The queries involved window functions over 400 million user events, self-joins to compute rolling averages, and several multi-step CTEs for labeling. Each run took 45 minutes. The instance was shared with production applications - running during peak hours caused application timeouts. Running overnight meant waking up at 6 AM to check whether the job had completed.

The team moved to Snowflake. The same query ran in 23 seconds on an X-Large virtual warehouse. They configured auto-suspend to shut the warehouse down after 60 seconds of inactivity. The total compute cost for that query was under $0.10 - they paid for 23 seconds of compute instead of maintaining an always-on server that also interfered with production traffic.

The unlock was not raw speed. It was the architecture: Snowflake separates storage from compute completely. The data lives in cloud object storage (S3, Azure Blob, or GCS). The compute is a fleet of virtual machines - called a virtual warehouse - that Snowflake spins up on demand, runs the query, and shuts down. You can have ten different teams each running their own isolated warehouse against the same underlying data, with no resource contention between them.

For ML teams, this architectural choice has profound implications. Feature engineering pipelines that had to be scheduled during off-peak hours can now run anytime. Dev, staging, and production environments can each have their own warehouse sizing. Experimental workloads can run in parallel without impacting dashboards. And with Snowflake's Python-native Snowpark API, you can run ML feature transformations and even model training directly inside Snowflake - no data movement required.

This lesson covers the Snowflake architecture in depth, explains how virtual warehouses work, walks through the Snowpark ML API for feature engineering and model training, and covers the platform features - time travel, zero-copy cloning, and data sharing - that make Snowflake particularly well-suited for ML workflows.


Why This Exists - The Shared Server Problem

Before managed cloud data warehouses, analytics and ML teams had two bad options:

Option 1: Share the production database. Run analytical queries against the same PostgreSQL or MySQL instance that powers the application. Works until it doesn't. An expensive GROUP BY locks a table and crashes your checkout flow on Black Friday.

Option 2: Build a dedicated analytics database. Stand up a separate server, replicate data from production, run queries there. Now you have two databases to maintain, replication lag to manage, ETL jobs to monitor, and a server that costs money whether you use it or not.

Both options fail to separate two fundamentally different concerns: data storage (which is static, durable, and needs to be consistent) and data computation (which is bursty, concurrent, and needs to scale independently).

Snowflake's founders recognized this in 2012 and built the platform around a simple insight: store data in cloud object storage (which is already cheap, durable, and infinitely scalable) and separate the query engines completely. Every query runs on its own elastically-allocated compute cluster. Storage costs scale with data volume. Compute costs scale with query complexity and concurrency. Neither interferes with the other.


Snowflake Architecture

The Three Layers

Cloud Storage Layer stores all data as columnar micro-partitions - typically 50–500 MB compressed files in Snowflake's internal format (based on Parquet). Data is compressed, encrypted, and stored in cloud object storage that Snowflake manages on your behalf. You never see or touch the underlying files.

Virtual Warehouse Layer is where queries actually run. A virtual warehouse is a fleet of EC2 instances (or equivalent) that Snowflake provisions when a query arrives. The warehouse reads micro-partitions from cloud storage, executes the query, and returns results. When idle, it can suspend - releasing all compute resources and stopping billing.

Cloud Services Layer is Snowflake's brain. It handles authentication, query optimization, metadata management, and result caching. Critically, it caches query results for 24 hours - identical queries served from cache are instant and cost nothing.


Virtual Warehouses - Sizing and Configuration

A virtual warehouse is sized on a T-shirt scale: XS, S, M, L, XL, 2XL, 3XL, 4XL, 5XL, 6XL. Each size doubles the compute resources and the cost per credit consumed.

SizevCores (approx)Credits/HourBest For
XS1 server1Development, small BI queries
S2 servers2Small teams, moderate query load
M4 servers4Production dashboards
L8 servers8Heavy ETL, large aggregations
XL16 servers16Feature engineering, complex joins
2XL32 servers32Large-scale ML transformations
4XL128 servers128Rare - very large batch jobs

A Snowflake credit costs approximately 2.002.00–4.00 depending on your cloud region and contract. An XL warehouse running for one hour costs roughly 3232–64.

:::tip Right-Sizing for ML Workloads For ML feature engineering, start with an XL warehouse and measure query duration. If a query takes less than 60 seconds on XL, try M or L - a smaller warehouse costs the same if the query finishes in proportionally less time. The cost formula is (credits/hour * hours_used), so a 2-minute XL query costs the same as a 4-minute L query (same credit-minutes consumed). :::

Auto-Suspend and Auto-Resume

Every warehouse should be configured with auto-suspend. The default is 10 minutes - which means you pay for 10 minutes of idle compute every time your job finishes. For bursty ML workloads, set this to 60 seconds:

-- Create a warehouse with aggressive auto-suspend
CREATE WAREHOUSE ml_feature_wh
WAREHOUSE_SIZE = 'X-LARGE'
AUTO_SUSPEND = 60 -- suspend after 60 seconds of inactivity
AUTO_RESUME = TRUE -- resume automatically when a query arrives
MIN_CLUSTER_COUNT = 1
MAX_CLUSTER_COUNT = 3; -- multi-cluster for concurrency

Auto-resume means the warehouse starts automatically when a query is submitted. The cold-start time for a suspended warehouse is typically 2–5 seconds.

Multi-Cluster Warehouses

A standard warehouse queues concurrent queries - the second query waits for the first. Multi-cluster warehouses solve this by spawning additional clusters when queue depth grows:

ALTER WAREHOUSE dashboard_wh SET
MIN_CLUSTER_COUNT = 1 -- one cluster at rest
MAX_CLUSTER_COUNT = 5 -- scale to 5 clusters under load
SCALING_POLICY = 'STANDARD'; -- STANDARD or ECONOMY

For ML feature pipelines that run in parallel (multiple experiments simultaneously), multi-cluster warehouses prevent bottlenecks.


Feature Engineering in Snowflake SQL

Snowflake SQL is ANSI-compliant with significant extensions for analytics. Here is a complete example of building features for a user churn prediction model:

-- User-level features for churn prediction
-- Computes rolling window features, recency, frequency, monetary value

CREATE OR REPLACE TABLE features.user_churn_features AS
WITH
-- Base event aggregations
event_agg AS (
SELECT
user_id,
DATE(event_timestamp) AS event_date,
COUNT(*) AS daily_events,
COUNT(DISTINCT session_id) AS daily_sessions,
SUM(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) AS daily_purchases,
SUM(CASE WHEN event_type = 'purchase' THEN revenue ELSE 0 END) AS daily_revenue
FROM raw.user_events
WHERE event_timestamp >= DATEADD(DAY, -90, CURRENT_DATE())
GROUP BY 1, 2
),

-- Rolling window features (7, 14, 30 day)
rolling_features AS (
SELECT
user_id,
event_date,
-- 7-day rolling sum
SUM(daily_events) OVER (PARTITION BY user_id ORDER BY event_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS events_7d,
SUM(daily_revenue) OVER (PARTITION BY user_id ORDER BY event_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS revenue_7d,
-- 30-day rolling sum
SUM(daily_events) OVER (PARTITION BY user_id ORDER BY event_date
ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS events_30d,
SUM(daily_revenue) OVER (PARTITION BY user_id ORDER BY event_date
ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS revenue_30d,
-- Days since last purchase (recency)
DATEDIFF(DAY,
MAX(CASE WHEN daily_purchases > 0 THEN event_date END)
OVER (PARTITION BY user_id ORDER BY event_date
ROWS BETWEEN 90 PRECEDING AND CURRENT ROW),
event_date) AS days_since_purchase
FROM event_agg
),

-- User-level profile join
user_profile AS (
SELECT
user_id,
account_age_days,
subscription_tier,
country
FROM dim.users
)

-- Final feature set for the most recent date
SELECT
rf.user_id,
rf.events_7d,
rf.events_30d,
SAFE_DIVIDE(rf.events_7d, NULLIF(rf.events_30d, 0)) AS event_recency_ratio,
rf.revenue_7d,
rf.revenue_30d,
rf.days_since_purchase,
up.account_age_days,
up.subscription_tier,
up.country
FROM rolling_features rf
JOIN user_profile up USING (user_id)
WHERE rf.event_date = DATEADD(DAY, -1, CURRENT_DATE())
;

This query runs across 400 million rows in under 30 seconds on an XL warehouse.


Python UDFs and UDAFs

When SQL is not expressive enough - for example, computing custom distance metrics, parsing complex strings, or running per-row model inference - Snowflake supports Python UDFs:

# Register a Python UDF for TF-IDF feature extraction
# Run this in a Snowflake Python worksheet or via Snowpark

from snowflake.snowpark import Session
from snowflake.snowpark.functions import udf
from snowflake.snowpark.types import FloatType, StringType, ArrayType

session = Session.builder.configs({
"account": "your_account",
"user": "your_user",
"password": "your_password",
"database": "ML_DB",
"schema": "FEATURES",
"warehouse": "ML_FEATURE_WH"
}).create()

# UDF: compute cosine similarity between two vectors stored as JSON arrays
@udf(name="cosine_similarity",
is_permanent=True,
stage_location="@ml_stage",
packages=["numpy"],
return_type=FloatType(),
input_types=[StringType(), StringType()])
def cosine_similarity(vec_a_json: str, vec_b_json: str) -> float:
import json, numpy as np
a = np.array(json.loads(vec_a_json))
b = np.array(json.loads(vec_b_json))
denom = np.linalg.norm(a) * np.linalg.norm(b)
if denom == 0:
return 0.0
return float(np.dot(a, b) / denom)
-- Use the UDF in SQL
SELECT
product_id,
cosine_similarity(product_embedding, query_embedding) AS similarity_score
FROM product_embeddings
CROSS JOIN (SELECT embedding AS query_embedding FROM search_queries WHERE query_id = 42)
ORDER BY similarity_score DESC
LIMIT 20;

User-Defined Aggregate Functions (UDAFs) let you implement custom aggregations in Python - for example, a custom median with outlier clipping:

from snowflake.snowpark.functions import udaf
from snowflake.snowpark.types import FloatType, ListType

@udaf(name="clipped_median",
is_permanent=True,
stage_location="@ml_stage",
packages=["numpy"],
return_type=FloatType(),
input_types=[FloatType()])
class ClippedMedian:
def __init__(self):
self._values = []

@property
def aggregate_state(self):
return self._values

def accumulate(self, value):
if value is not None and -1e6 < value < 1e6: # clip outliers
self._values.append(value)

def merge(self, other_state):
self._values.extend(other_state)

def finish(self):
import numpy as np
return float(np.median(self._values)) if self._values else None

Snowpark - Python DataFrames Inside Snowflake

Snowpark is Snowflake's Python DataFrame API. Unlike pandas (which downloads data to your laptop), Snowpark DataFrames are lazy - they translate your Python code into Snowflake SQL and run it inside the Snowflake warehouse. Data never leaves Snowflake.

from snowflake.snowpark import Session
from snowflake.snowpark import functions as F
from snowflake.snowpark.types import DoubleType

session = Session.builder.configs(connection_params).create()

# Load a Snowflake table as a Snowpark DataFrame (no data downloaded yet)
df = session.table("raw.user_events")

# All transformations are lazy - they build a query plan
feature_df = (
df
.filter(F.col("event_timestamp") >= F.dateadd("day", F.lit(-90), F.current_date()))
.group_by("user_id")
.agg(
F.count("*").alias("total_events"),
F.sum(F.when(F.col("event_type") == "purchase", 1).otherwise(0)).alias("total_purchases"),
F.sum("revenue").alias("total_revenue"),
F.max("event_timestamp").alias("last_event_ts"),
F.countDistinct("session_id").alias("unique_sessions")
)
.with_column("avg_revenue_per_session",
F.col("total_revenue") / F.nullif(F.col("unique_sessions"), F.lit(0)))
)

# Write back to Snowflake - this triggers actual execution
feature_df.write.mode("overwrite").save_as_table("features.user_features_snowpark")
print(f"Written {feature_df.count()} rows to features.user_features_snowpark")

The Snowpark DataFrame API covers most pandas operations: filtering, grouping, joining, window functions, pivoting. For anything not in the API, you can fall back to session.sql("...").


Snowpark ML - Training Models Inside Snowflake

Snowflake's Snowpark ML library lets you train scikit-learn-compatible models inside Snowflake using Snowpark Stored Procedures. The key benefit: training data never leaves Snowflake.

# Full Snowpark ML pipeline: feature engineering → training → registration
# All runs inside a Snowflake Stored Procedure

from snowflake.snowpark import Session
from snowflake.snowpark import functions as F
from snowflake.ml.modeling.preprocessing import StandardScaler, LabelEncoder
from snowflake.ml.modeling.ensemble import GradientBoostingClassifier
from snowflake.ml.registry import Registry
import pandas as pd

def train_churn_model(session: Session) -> str:
"""
Stored procedure that trains a churn model entirely inside Snowflake.
Returns the registered model version name.
"""
# 1. Load pre-engineered features
feature_df = session.table("features.user_churn_features")

# 2. Define feature and label columns
feature_cols = [
"EVENTS_7D", "EVENTS_30D", "EVENT_RECENCY_RATIO",
"REVENUE_7D", "REVENUE_30D", "DAYS_SINCE_PURCHASE", "ACCOUNT_AGE_DAYS"
]
label_col = "CHURNED"

# 3. Train/test split (Snowpark ML handles this in-warehouse)
train_df, test_df = feature_df.random_split([0.8, 0.2], seed=42)

# 4. Scale features
scaler = StandardScaler(
input_cols=feature_cols,
output_cols=[f"{c}_SCALED" for c in feature_cols]
)
train_df = scaler.fit(train_df).transform(train_df)
test_df = scaler.transform(test_df)

scaled_cols = [f"{c}_SCALED" for c in feature_cols]

# 5. Train gradient boosting classifier
model = GradientBoostingClassifier(
input_cols=scaled_cols,
label_cols=[label_col],
output_cols=["PREDICTED_CHURN", "CHURN_PROBABILITY"],
n_estimators=200,
max_depth=6,
learning_rate=0.05,
random_state=42
)
model.fit(train_df)

# 6. Evaluate
predictions = model.predict(test_df)
metrics = predictions.select(
F.avg(
(F.col("PREDICTED_CHURN") == F.col(label_col)).cast("integer")
).alias("accuracy")
).collect()
accuracy = metrics[0]["ACCURACY"]
print(f"Model accuracy: {accuracy:.4f}")

# 7. Register model in Snowflake Model Registry
registry = Registry(session=session, database_name="ML_DB", schema_name="MODEL_REGISTRY")
mv = registry.log_model(
model=model,
model_name="churn_classifier",
version_name="v1",
metrics={"accuracy": float(accuracy)},
comment=f"GBM churn model trained on {train_df.count()} rows"
)
return mv.version_name


# Register as a stored procedure and call it
session.sproc.register(
func=train_churn_model,
name="train_churn_model",
is_permanent=True,
stage_location="@ml_stage",
packages=["snowflake-ml-python"],
replace=True
)

# Run the training job
result = session.call("train_churn_model")
print(f"Registered model version: {result}")

Time Travel - ML Experiment Reproducibility

Snowflake Time Travel lets you query any table as it existed at any point in the past (up to 90 days for Enterprise Edition). This is invaluable for ML reproducibility.

-- Query a feature table as it existed 7 days ago
SELECT * FROM features.user_churn_features
AT (OFFSET => -60*60*24*7); -- 7 days ago in seconds

-- Query as of a specific timestamp (e.g., when you started an experiment)
SELECT * FROM features.user_churn_features
AT (TIMESTAMP => '2026-03-01 09:00:00'::TIMESTAMP_TZ);

-- Query as of a specific query ID (exactly the state a previous query saw)
SELECT * FROM features.user_churn_features
AT (STATEMENT => 'your_query_id_here');

-- Restore a table that was accidentally dropped
UNDROP TABLE features.user_churn_features;

-- Restore a table to a previous state (roll back bad ETL)
CREATE OR REPLACE TABLE features.user_churn_features CLONE
features.user_churn_features AT (OFFSET => -3600); -- 1 hour ago

:::tip Time Travel for ML Experiment Logging When you start a training experiment, record the current timestamp. You can always recreate the exact training dataset from Snowflake Time Travel - no need to materialize a snapshot. This is especially useful when debugging why a model performed differently in production.

import snowflake.connector
from datetime import datetime, timezone

# Log experiment metadata
experiment_start = datetime.now(timezone.utc).isoformat()
print(f"Training started at: {experiment_start}")
# Store this in your MLflow or experiment tracking system
# Later: AT(TIMESTAMP => '{experiment_start}') to reproduce the exact data

:::


Zero-Copy Cloning - Free Dev Environments

Zero-copy cloning creates an instant copy of any database, schema, or table. The clone shares the underlying micro-partitions with the original - no data is physically copied. You only pay for new data that diverges from the original.

-- Clone the production ML database for a feature experiment
-- Instant (metadata-only operation), costs nothing upfront
CREATE DATABASE ml_db_experiment_v3 CLONE ml_db_production;

-- Clone a specific schema
CREATE SCHEMA features_dev CLONE features_production;

-- Clone a table for testing a new ETL script
CREATE TABLE user_events_test CLONE raw.user_events;

-- Now run destructive transformations without touching production
UPDATE user_events_test SET revenue = revenue * 1.1; -- safe - isolated clone

The ML use case: teams routinely need isolated copies of the feature store for:

  • Testing a new feature engineering pipeline without impacting production
  • Running a data quality audit on a snapshot
  • Providing each ML engineer their own feature store sandbox
  • Creating a "golden dataset" clone for model evaluation

With zero-copy cloning, each of these environments costs nothing until data diverges. A clone of a 10 TB feature store is instant and costs $0.00 upfront.

:::warning Clone Storage Costs Clones are free at creation, but as you modify data in the clone, those modified micro-partitions are charged as new storage. If you clone a 10 TB table and run an UPDATE that touches 100% of rows, you now have 20 TB of storage. Be intentional about when you clone vs. when you create a view. :::


Data Sharing - Cross-Team Feature Distribution

Snowflake's data sharing lets one account share live data with another account - no copying, no ETL, no latency. The consumer sees the data in real time as the provider updates it.

-- Provider side: create a share and add objects
CREATE SHARE feature_store_share;

GRANT USAGE ON DATABASE ml_db TO SHARE feature_store_share;
GRANT USAGE ON SCHEMA ml_db.features TO SHARE feature_store_share;
GRANT SELECT ON TABLE ml_db.features.user_churn_features TO SHARE feature_store_share;
GRANT SELECT ON TABLE ml_db.features.product_embeddings TO SHARE feature_store_share;

-- Add the consumer account (another Snowflake account in the org)
ALTER SHARE feature_store_share ADD ACCOUNTS = 'consumer_account_identifier';
-- Consumer side: create a database from the share (read-only, live data)
CREATE DATABASE shared_features FROM SHARE provider_account.feature_store_share;

-- Query shared features directly
SELECT * FROM shared_features.features.user_churn_features
WHERE event_date = CURRENT_DATE();

The ML pattern: a central Data Platform team owns the feature store. They share it with the ML team, the Analytics team, and external partners - each consuming live, consistent features without duplication.


Cost Optimization in Snowflake

Result Cache

Snowflake's cloud services layer caches query results for 24 hours. Identical queries (same SQL text, same warehouse) are served instantly from cache at no compute cost.

-- This query hits the cache if run within 24 hours of the first execution
SELECT user_id, events_30d, revenue_30d
FROM features.user_churn_features
WHERE event_date = '2026-03-11';

-- Check if your query used the cache
SELECT query_id, query_text, execution_status, bytes_scanned, result_from_cache
FROM TABLE(information_schema.query_history())
WHERE result_from_cache = TRUE
ORDER BY start_time DESC
LIMIT 10;

Warehouse Scheduling

For batch ML pipelines, schedule warehouse activity during off-peak hours when Snowflake credits may be cheaper (if on a variable-rate contract):

import snowflake.connector

def run_feature_pipeline(connection_params: dict) -> None:
with snowflake.connector.connect(**connection_params) as conn:
cursor = conn.cursor()

# Use a dedicated, appropriately-sized warehouse
cursor.execute("USE WAREHOUSE ml_batch_wh_xl")

# Run feature engineering
cursor.execute("""
INSERT OVERWRITE INTO features.user_churn_features
SELECT ... FROM raw.user_events
WHERE event_timestamp >= DATEADD(DAY, -90, CURRENT_DATE())
""")

# Immediately suspend the warehouse after the job
cursor.execute("ALTER WAREHOUSE ml_batch_wh_xl SUSPEND")

Storage Cost Tiers

Snowflake charges 2323–46/TB/month for active storage and $4/TB/month for Fail-safe storage (a non-queryable backup for disaster recovery). To reduce costs:

-- Set short data retention for development schemas (reduces time travel storage cost)
ALTER SCHEMA features_dev SET DATA_RETENTION_TIME_IN_DAYS = 1; -- dev: 1 day

-- Set full retention for production ML data
ALTER SCHEMA features_production SET DATA_RETENTION_TIME_IN_DAYS = 14;

-- Check the largest tables in your database
SELECT table_schema, table_name,
ROUND(active_bytes / 1e9, 2) AS active_gb,
ROUND(time_travel_bytes / 1e9, 2) AS time_travel_gb,
ROUND(failsafe_bytes / 1e9, 2) AS failsafe_gb
FROM information_schema.table_storage_metrics
ORDER BY active_bytes DESC
LIMIT 20;

Production Notes

Connection pooling: Snowflake connections are expensive to establish (~1 second). Use snowflake-connector-python with connection pooling for applications that make many short queries:

from snowflake.connector.connection import SnowflakeConnection
from contextlib import contextmanager
import queue, threading

# Simple connection pool for high-frequency feature serving
class SnowflakePool:
def __init__(self, connection_params: dict, pool_size: int = 5):
self._params = connection_params
self._pool = queue.Queue(maxsize=pool_size)
for _ in range(pool_size):
self._pool.put(snowflake.connector.connect(**connection_params))

@contextmanager
def get_connection(self):
conn = self._pool.get()
try:
yield conn
finally:
self._pool.put(conn)

pool = SnowflakePool(connection_params, pool_size=10)

def fetch_user_features(user_ids: list[int]) -> list[dict]:
with pool.get_connection() as conn:
cursor = conn.cursor(snowflake.connector.DictCursor)
cursor.execute(
"SELECT * FROM features.user_churn_features WHERE user_id IN (%s)"
% ",".join(str(i) for i in user_ids)
)
return cursor.fetchall()

**Avoid SELECT ***: Always specify columns. Snowflake's columnar storage skips columns that are not referenced - SELECT * defeats this optimization and increases bytes scanned.

Use LIMIT with query_tag for cost attribution:

-- Tag queries for cost attribution in Snowflake query history
ALTER SESSION SET QUERY_TAG = '{"team": "ml", "project": "churn-v2", "env": "prod"}';

:::danger Never Auto-Resume Production Warehouses with No Size Limit Auto-resume is convenient, but make sure you have resource monitors in place. A runaway query or an infinite loop in a Snowpark job can consume thousands of credits before anyone notices.

-- Create a resource monitor that alerts at $500 and suspends at $1000
CREATE RESOURCE MONITOR ml_cost_guard
WITH CREDIT_QUOTA = 1000
TRIGGERS
ON 50 PERCENT DO NOTIFY
ON 80 PERCENT DO NOTIFY
ON 100 PERCENT DO SUSPEND;

ALTER WAREHOUSE ml_feature_wh SET RESOURCE_MONITOR = ml_cost_guard;

:::

:::warning Time Travel Does Not Protect Against Schema Changes If you DROP a column and re-add it with a different type, Time Travel will restore the rows but the schema of the historical query reflects the current schema, not the historical one. For true point-in-time schema reproducibility, use zero-copy clones at key experiment milestones. :::


Interview Q&A

Q1: Explain Snowflake's architecture. How is it different from traditional data warehouses like Teradata or Redshift?

Traditional warehouses (Teradata, early Redshift) use a shared-disk or shared-nothing MPP architecture - compute and storage are tightly coupled. Adding more query throughput requires adding more storage nodes; adding more storage requires adding more compute nodes. The two scale together whether you need them to or not.

Snowflake separates storage, compute, and services into three independent layers. Storage is cloud object storage (S3/GCS/Azure Blob) - it scales to petabytes automatically at commodity prices. Compute is virtual warehouses - elastically provisioned, independently sized, auto-suspending. The cloud services layer (metadata, auth, query optimization) is multi-tenant and always available.

The practical implications: multiple teams can run isolated workloads against the same data simultaneously with zero resource contention. Compute scales to zero when idle. Storage and compute are billed independently. This architecture is fundamentally impossible in a shared-nothing MPP system.


Q2: A query takes 8 minutes on a Medium warehouse. Will upgrading to X-Large always make it faster? Explain the trade-offs.

Not necessarily. Snowflake warehouse sizing follows a rule of thumb: upgrading one size doubles compute resources, which typically halves query duration for CPU-bound or data-movement-bound queries.

However, some queries are bottlenecked at the storage I/O layer - they scan large volumes of micro-partitions, and the bottleneck is storage bandwidth, not compute. Doubling compute does not increase storage bandwidth linearly.

Other queries are bottlenecked by poorly written SQL - Cartesian joins, SELECT * on wide tables, missing partition filters. A larger warehouse will execute a bad query faster, but not proportionally. The right fix is query optimization first (add partition filter, eliminate full scans), then right-size the warehouse.

Also consider cost: upgrading from M to XL is 4x the credits per hour. If the query drops from 8 minutes to 2 minutes, the cost is the same (32 credit-minutes). If it only drops to 4 minutes, you just spent twice as much.


Q3: What is Snowpark and why would an ML team use it instead of pandas?

Snowpark is Snowflake's Python DataFrame API that executes computation inside the Snowflake warehouse rather than on the client machine. Snowpark DataFrames are lazy - transformations build a query plan that is sent to Snowflake and executed in the warehouse.

The fundamental advantage for ML teams is eliminating data movement. A typical ML workflow with pandas requires: download training data from Snowflake (30 GB, several minutes), process it in pandas on a local machine or EC2 instance (requires enough RAM to fit the data), write results back to Snowflake. With Snowpark, the entire transformation runs inside Snowflake - no download, no RAM requirements on the client, no network transfer costs.

Additional benefits: Snowpark code is version-controlled Python, easier to test than raw SQL, supports custom UDFs and UDAFs, and integrates with Snowflake Model Registry for end-to-end ML lifecycle management.


Q4: How does Snowflake Time Travel support ML reproducibility? What are its limitations?

Time Travel lets you query any Snowflake table as it existed at any point in the past (up to 90 days for Enterprise, 1 day for Standard). For ML reproducibility, this means: if you record the timestamp when training started, you can always reconstruct the exact dataset that was used - no need to materialize and store training snapshots.

This solves a real problem: feature stores are continuously updated. If a model is retrained daily, the training data on Day 1 and Day 30 may differ significantly. Time Travel lets you audit exactly what data a given model version trained on, supporting debugging and regulatory compliance.

Limitations:

  • Maximum 90 days - historical reproducibility beyond that window requires explicit snapshots
  • Time Travel adds storage costs (you pay for historical versions of micro-partitions)
  • Schema changes are not fully tracked - querying historical data through a current schema may not accurately reflect the original schema
  • DROP TABLE followed by UNDROP restores the table but TIME TRAVEL window resets

Q5: When would you use Snowflake Data Sharing vs. replicating data between databases?

Data Sharing is preferred when: both parties are Snowflake accounts, the data needs to be live and up-to-date, and you want zero ETL overhead. The provider does not copy data - the consumer reads directly from the provider's micro-partitions through a metadata reference. No replication lag, no pipeline maintenance, no duplicate storage cost.

Replication makes sense when: the consumer is not on Snowflake, the consumer needs the data in a different region for latency reasons, or the consumer needs to run workloads that modify the data (Data Sharing is read-only). Snowflake also supports cross-region and cross-cloud account replication for disaster recovery - this is different from Data Sharing (it physically copies data).

For a central feature store pattern, Data Sharing is almost always the right answer: the Data Platform team owns one authoritative feature table, and ML teams, Analytics teams, and downstream services consume it as a live share. This ensures all consumers see the same data without the consistency problems that arise when each team maintains their own copy.


Q6: How do you prevent runaway costs in a Snowflake environment used by multiple ML teams?

Three layers of control:

Resource monitors: set credit quotas at the account, warehouse, or user level. When a threshold is reached, Snowflake can alert, suspend the warehouse, or suspend all queries. Set at the warehouse level so each team's warehouse is independently governed.

Query tags and cost attribution: tag every query with team/project metadata using ALTER SESSION SET QUERY_TAG. Use the QUERY_HISTORY view to produce per-team cost reports and identify expensive queries.

Warehouse policies: each team gets their own warehouse with size limits and auto-suspend. Disable auto-resume for development warehouses - require explicit ALTER WAREHOUSE RESUME before running jobs. This prevents forgotten background jobs from consuming credits indefinitely.

Additionally, enforce column-selective queries (no SELECT *), require partition filters on large tables, and use Snowflake's Query Profile to detect full table scans before they recur.

© 2026 EngineersOfAI. All rights reserved.