Skip to main content

:::tip 🎮 Interactive Playground Visualize this concept: Try the Spark Batch Processing demo on the EngineersOfAI Playground - no code required. :::

Google BigQuery

A startup ran a query scanning 2 TB of raw events. Their Redshift cluster would have taken 8 minutes and cost roughly 0.11thatistheEC2costfor8minutesontheirra3.xlplusnodes.BigQueryranitin4secondsandcharged0.11 - that is the EC2 cost for 8 minutes on their ra3.xlplus nodes. BigQuery ran it in 4 seconds and charged 10. Both numbers are correct. BigQuery's serverless model shines for infrequent large scans run by teams that cannot afford to babysit a cluster - but for predictable, heavy, always-on workloads, the math flips entirely.

That tension - instant, zero-management queries at a premium per-scan price versus efficient reserved capacity for predictable workloads - is the central trade-off of BigQuery. Understanding it is the difference between a 200/monthBigQuerybillanda200/month BigQuery bill and a 20,000/month BigQuery bill for the same data and queries.

This lesson covers BigQuery's internal architecture (the Dremel engine, Colossus storage, and Jupiter network), its pricing models, BigQuery ML for in-SQL model training, advanced query optimization through partitioning and clustering, its deep integration with Vertex AI, and the operational patterns that separate teams spending efficiently from those spending recklessly.


Why This Exists - The On-Demand Analytics Problem

The traditional analytics workflow required provisioning capacity in advance. You estimated peak query load, bought servers or rented cloud VMs sized for that peak, and ran queries against that fixed fleet. The servers ran 24/7. On Monday morning when analysts arrived and fired up 40 concurrent queries, they competed for the same fixed resources. On Saturday night, the servers sat idle at full cost.

Google faced this problem at a scale no other company had encountered. In 2006, internal teams were running ad hoc queries across petabytes of web crawl data, ad click logs, and search indices. There was no commercial product that worked at that scale. The solution was Dremel - a distributed query engine that could fan out a query across thousands of machines, execute it in seconds, and release those machines immediately.

BigQuery is the managed, public version of Dremel. It abstracts away all infrastructure - there are no clusters to provision, no nodes to configure, no auto-scaling groups to set up. You write SQL, submit it, and get results. The infrastructure layer is entirely Google's problem.


BigQuery Architecture

Dremel - The Query Engine

Dremel executes queries using a multi-level serving tree. A root server receives the query and distributes it to intermediate servers, which further distribute to leaf servers. Leaf servers scan actual Colossus data files. Results aggregate back up the tree.

The key insight: Dremel decomposes a query into thousands of smaller operations and assigns them to slots - the unit of compute in BigQuery. Each slot is roughly one vCPU. On-demand queries get up to 2,000 slots. Reservations let you purchase dedicated slot capacity.

Colossus - The Storage Layer

BigQuery stores data in Colossus, Google's distributed file system, in a format called Capacitor - a columnar format similar to Parquet. Colossus automatically handles replication, encryption, and geographic distribution. You never interact with it directly.

The columnar format is critical for BigQuery's pricing model: BigQuery charges for bytes scanned per query. Columnar storage means a query that only touches 3 out of 100 columns scans approximately 3% of the data - a 97% cost reduction compared to row-oriented storage.

Jupiter - The Network

Jupiter is Google's custom network fabric connecting Dremel and Colossus. Its bisection bandwidth exceeds 1 petabit per second. This is what makes it possible for a query to read 2 TB from storage in under 10 seconds - the network is fast enough to stream all that data to compute nodes simultaneously.


Pricing Models

This is the most practically important thing to understand about BigQuery.

On-Demand Pricing

You pay $5 per TB of data scanned by queries. There are no hourly charges for having a dataset - you only pay when queries run.

When it wins: infrequent, ad hoc queries by analysts. A team running 50 queries per day across 100 GB partitions spends roughly $25/day. No cluster to maintain.

When it loses: always-on ETL pipelines scanning terabytes per hour. A pipeline scanning 10 TB per hour, 24 hours a day, costs 1,200/day=1,200/day = 36,000/month on-demand.

Flat-Rate Reservations (Legacy)

Purchase a fixed number of slots (100 slots = ~$2,000/month). Queries consume slots from your reservation rather than on-demand capacity.

Break-even: if your on-demand spend exceeds the flat-rate cost for equivalent slots, switch.

BigQuery Editions (Current)

BigQuery's current pricing model uses three editions - Standard, Enterprise, and Enterprise Plus - each with different slot autoscaling and commitment options.

EditionAutoscaleCommitmentBest For
StandardYes, per-second billingNone requiredModerate, variable workloads
EnterpriseYes, per-second billing1-year or 3-year for discountPredictable heavy workloads
Enterprise PlusYes3-year commitmentLargest workloads, compliance

:::tip Choosing a Pricing Model Start with on-demand. Measure your monthly query costs. If costs are predictable and exceed $2,000–3,000/month, model out whether a reservation pays off. Use the BigQuery Reservations API to set baseline + autoscaling capacity - you get predictable minimums with burst capacity for peaks. :::


Partitioning and Clustering - The Most Impactful Optimization

Partitioning

Partitioned tables split data into segments based on a column value. BigQuery only reads partitions that match the query filter - directly reducing bytes scanned.

-- Create a partitioned table (partition by event date)
CREATE TABLE events.user_events_partitioned (
event_id STRING,
user_id INT64,
event_type STRING,
revenue FLOAT64,
event_date DATE
)
PARTITION BY event_date
OPTIONS (
partition_expiration_days = 365 -- auto-expire old partitions
);

-- This query scans ONLY the 2026-03-11 partition (~5 GB vs 2 TB full table)
SELECT user_id, COUNT(*) AS events
FROM events.user_events_partitioned
WHERE event_date = '2026-03-11' -- partition filter: BigQuery prunes all other partitions
GROUP BY user_id;

-- Always use _PARTITIONTIME or the partition column in WHERE clauses
-- This query accidentally scans ALL partitions (no pruning):
SELECT * FROM events.user_events_partitioned
WHERE DATE(event_timestamp) = '2026-03-11'; -- WRONG: computed column, not the partition col

Partition types:

  • PARTITION BY DATE(column) - time-unit column partitioning (most common)
  • PARTITION BY RANGE_BUCKET(column, GENERATE_ARRAY(0, 100, 10)) - integer range partitioning
  • PARTITION BY _PARTITIONDATE - ingestion-time partitioning (automatic)

Clustering

Clustering sorts data within each partition by the specified columns. BigQuery can skip data blocks that don't match cluster column filters - reducing bytes scanned further within a partition.

-- Partition by date, cluster by user_id and event_type
-- BigQuery will physically sort data within each partition by these columns
CREATE TABLE events.user_events_optimized
PARTITION BY event_date
CLUSTER BY user_id, event_type
AS SELECT * FROM events.user_events_raw;

-- This query benefits from BOTH partition pruning AND cluster pruning:
SELECT revenue, event_type
FROM events.user_events_optimized
WHERE event_date BETWEEN '2026-01-01' AND '2026-03-11' -- partition filter
AND user_id = 12345 -- cluster filter
AND event_type = 'purchase'; -- cluster filter

Cost before vs. after (illustrative):

QueryScanned (no partition/cluster)Scanned (partitioned + clustered)Cost Reduction
Single-day user events2 TB8 GB99.6%
User-specific 30-day history2 TB500 MB99.97%
All events for an event type2 TB50 GB97.5%

BigQuery ML - Training Models in SQL

BigQuery ML lets you train, evaluate, and serve ML models using SQL syntax - no Python required, no data export. Training data stays in BigQuery. This is powerful for SQL-native teams and for models that need to score data at BigQuery scale.

Creating and Training a Model

-- Train a logistic regression model for user churn prediction
CREATE OR REPLACE MODEL `ml_models.churn_classifier`
OPTIONS (
model_type = 'LOGISTIC_REG',
input_label_cols = ['churned'],
l2_reg = 0.1,
max_iterations = 100,
learn_rate_strategy = 'line_search',
data_split_method = 'RANDOM',
data_split_eval_fraction = 0.2
) AS
SELECT
events_7d,
events_30d,
revenue_30d,
days_since_purchase,
account_age_days,
subscription_tier,
churned
FROM `features.user_churn_features`
WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY);
-- Evaluate model performance
SELECT *
FROM ML.EVALUATE(MODEL `ml_models.churn_classifier`,
(SELECT * FROM `features.user_churn_features`
WHERE event_date = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)));

-- Returns: precision, recall, accuracy, f1_score, roc_auc, log_loss
-- Generate predictions at scale (scores the entire feature table)
SELECT
user_id,
predicted_churned,
predicted_churned_probs[OFFSET(1)].prob AS churn_probability
FROM ML.PREDICT(MODEL `ml_models.churn_classifier`,
(SELECT * FROM `features.user_churn_features`
WHERE event_date = CURRENT_DATE()))
ORDER BY churn_probability DESC;

Supported Model Types in BigQuery ML

Model TypeSQL optionUse Case
Linear regressionLINEAR_REGRevenue prediction, continuous targets
Logistic regressionLOGISTIC_REGBinary/multiclass classification
Boosted trees (XGBoost)BOOSTED_TREE_CLASSIFIER / BOOSTED_TREE_REGRESSORTabular ML, robust baseline
Deep neural networkDNN_CLASSIFIER / DNN_REGRESSORComplex non-linear patterns
AutoML (Vertex AI)AUTOML_CLASSIFIERAutomated model selection
K-means clusteringKMEANSUnsupervised segmentation
Matrix factorizationMATRIX_FACTORIZATIONCollaborative filtering / recommendations
Time series (ARIMA+)ARIMA_PLUSDemand forecasting, anomaly detection
Imported TensorFlowTENSORFLOWCustom TF model served in BigQuery

When to Use BigQuery ML vs. Vertex AI

ConsiderationBigQuery MLVertex AI
Data stays in BigQueryYesNo (must export)
SQL-native teamsExcellentRequires Python
Custom model architecturesLimitedFull flexibility
Training computeBigQuery slotsDedicated GPU/TPU VMs
Production servingBigQuery (high latency)Low-latency REST endpoint
Model explainabilityBuilt-in (ML.GLOBAL_EXPLAIN)Explainable AI on Vertex

Feature Engineering for ML in BigQuery SQL

ARRAY and STRUCT for Nested Features

BigQuery natively supports nested and repeated fields - arrays and structs - without flattening. This is ideal for representing sequence features:

-- User-level features with ARRAY aggregations (session sequences)
SELECT
user_id,
-- Aggregate last 10 session durations as an array (for sequence models)
ARRAY_AGG(session_duration_seconds ORDER BY session_start DESC LIMIT 10)
AS recent_session_durations,
-- Struct of aggregated stats
STRUCT(
COUNT(*) AS total_sessions,
AVG(session_duration_seconds) AS avg_duration,
MAX(session_duration_seconds) AS max_duration
) AS session_stats
FROM events.sessions
WHERE session_start >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY user_id;

Window Functions for Time-Series Features

-- Rolling features with BigQuery window functions
WITH daily_events AS (
SELECT
user_id,
DATE(event_timestamp) AS event_date,
COUNT(*) AS daily_events,
SUM(revenue) AS daily_revenue
FROM events.user_events
WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
GROUP BY 1, 2
)

SELECT
user_id,
event_date,
-- Rolling 7-day sum
SUM(daily_events) OVER (
PARTITION BY user_id
ORDER BY event_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS events_7d,

-- Rolling 30-day revenue
SUM(daily_revenue) OVER (
PARTITION BY user_id
ORDER BY event_date
ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
) AS revenue_30d,

-- Lag features (yesterday's activity)
LAG(daily_events, 1) OVER (PARTITION BY user_id ORDER BY event_date) AS events_yesterday,

-- Days since last purchase
DATE_DIFF(
event_date,
LAST_VALUE(CASE WHEN daily_revenue > 0 THEN event_date END IGNORE NULLS)
OVER (PARTITION BY user_id ORDER BY event_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
DAY
) AS days_since_purchase

FROM daily_events;

Approximate Aggregations for Scale

For very large datasets where exact counts are too expensive:

-- APPROX functions trade tiny accuracy loss for massive speed gains
SELECT
event_date,
COUNT(DISTINCT user_id) AS exact_unique_users, -- expensive on 10B rows
APPROX_COUNT_DISTINCT(user_id) AS approx_unique_users, -- ~2% error, 10x faster

APPROX_QUANTILES(revenue, 100)[OFFSET(50)] AS median_revenue,
APPROX_QUANTILES(revenue, 100)[OFFSET(95)] AS p95_revenue,
APPROX_QUANTILES(revenue, 100)[OFFSET(99)] AS p99_revenue,

APPROX_TOP_COUNT(event_type, 5) AS top_5_event_types
FROM events.user_events
WHERE event_date = CURRENT_DATE()
GROUP BY event_date;

BigQuery + Vertex AI Integration

For ML workflows that outgrow BigQuery ML's model types, BigQuery integrates natively with Vertex AI.

Pattern 1: Export Training Data from BigQuery to Vertex AI

from google.cloud import bigquery, aiplatform

# Step 1: Export training data from BigQuery to GCS
bq_client = bigquery.Client(project="my-project")

export_job = bq_client.extract_table(
"my-project.features.user_churn_features",
"gs://my-bucket/training-data/churn-features-*.csv",
job_config=bigquery.ExtractJobConfig(
destination_format=bigquery.DestinationFormat.CSV,
print_header=True
)
)
export_job.result()
print("Export complete")

# Step 2: Trigger Vertex AI training job
aiplatform.init(project="my-project", location="us-central1")

job = aiplatform.CustomTrainingJob(
display_name="churn-classifier-v2",
script_path="trainer/train.py",
container_uri="us-docker.pkg.dev/vertex-ai/training/scikit-learn-cpu.1-0:latest",
requirements=["lightgbm==4.3.0"],
model_serving_container_image_uri=(
"us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.1-0:latest"
)
)

model = job.run(
dataset=None,
model_display_name="churn-classifier-v2",
args=[
"--training-data-uri", "gs://my-bucket/training-data/churn-features-*.csv",
"--n-estimators", "500",
"--learning-rate", "0.05"
],
replica_count=1,
machine_type="n1-standard-8"
)

Pattern 2: Import Vertex AI Predictions Back to BigQuery

# After model is deployed, batch-score from BigQuery and write predictions back

from google.cloud import bigquery, aiplatform
import pandas as pd

# Export scoring data
bq_client = bigquery.Client(project="my-project")
score_df = bq_client.query("""
SELECT user_id, events_7d, events_30d, revenue_30d,
days_since_purchase, account_age_days, subscription_tier
FROM features.user_churn_features
WHERE event_date = CURRENT_DATE()
""").to_dataframe()

# Batch predict via Vertex AI endpoint
endpoint = aiplatform.Endpoint("projects/my-project/locations/us-central1/endpoints/123")
predictions = endpoint.predict(instances=score_df.drop("user_id", axis=1).to_dict("records"))

# Write predictions back to BigQuery
score_df["churn_probability"] = [p["churn_probability"] for p in predictions.predictions]
score_df["scored_at"] = pd.Timestamp.now(tz="UTC")

bq_client.load_table_from_dataframe(
score_df[["user_id", "churn_probability", "scored_at"]],
"my-project.predictions.churn_scores",
job_config=bigquery.LoadJobConfig(write_disposition="WRITE_APPEND")
).result()

BigQuery ML with Remote Models (Vertex AI Integration)

BigQuery now supports remote models - you register a Vertex AI endpoint as a BigQuery model and call it from SQL:

-- Register a Vertex AI endpoint as a BigQuery remote model
CREATE OR REPLACE MODEL `ml_models.churn_vertex_remote`
REMOTE WITH CONNECTION `us.vertex-connection`
OPTIONS (endpoint = 'https://us-central1-aiplatform.googleapis.com/v1/projects/my-project/locations/us-central1/endpoints/123456');

-- Score directly from SQL using the Vertex AI model
SELECT user_id, predicted_label, score
FROM ML.PREDICT(MODEL `ml_models.churn_vertex_remote`,
(SELECT user_id, events_7d, events_30d, revenue_30d
FROM features.user_churn_features
WHERE event_date = CURRENT_DATE()));

Streaming Inserts vs. Storage Write API

Legacy Streaming Inserts

The original BigQuery streaming API inserted rows in real time but was expensive ($0.01 per 200 MB) and had availability limitations. Data appeared within seconds but was not available for DML operations immediately.

from google.cloud import bigquery

client = bigquery.Client()
table_id = "my-project.events.user_events"

# Legacy streaming insert (still works, more expensive)
errors = client.insert_rows_json(table_id, [
{"user_id": 123, "event_type": "purchase", "revenue": 49.99,
"event_timestamp": "2026-03-12T10:30:00Z"}
])
if errors:
raise Exception(f"Insert errors: {errors}")

The Storage Write API is the modern approach - lower cost ($0.025 per GB), transactional semantics, and exactly-once delivery:

from google.cloud.bigquery_storage_v1 import BigQueryWriteClient, types
from google.protobuf import descriptor_pool, descriptor_pb2
import json

# Storage Write API: higher throughput, lower cost, exactly-once semantics
write_client = BigQueryWriteClient()

# Create a write stream
parent = write_client.table_path("my-project", "events", "user_events")
write_stream = types.WriteStream()
write_stream.type_ = types.WriteStream.Type.COMMITTED # immediately visible

response = write_client.create_write_stream(
parent=parent, write_stream=write_stream
)
stream_name = response.name

# Append rows
# (In practice, use the bigquery-storage Python library's AppendRowsStream)
print(f"Stream created: {stream_name}")
FeatureLegacy StreamingStorage Write API
Cost$0.01/200 MB$0.025/GB
ThroughputLowHigh
Exactly-onceNoYes
TransactionsNoYes (pending streams)
RecommendedNoYes

BigQuery Omni - Federated Queries Across Clouds

BigQuery Omni lets you query data in AWS S3 or Azure Blob Storage without copying it to Google Cloud. Compute runs in the other cloud (BigQuery uses Google-managed infrastructure in AWS/Azure regions).

-- Create a BigQuery Omni connection to S3
-- (configured via BigQuery Connection API or Console)

-- Query S3 data from BigQuery (data stays in AWS)
SELECT
event_date,
COUNT(*) AS event_count,
SUM(revenue) AS total_revenue
FROM `my-project.aws_dataset.s3_user_events` -- external table pointing to S3
WHERE event_date >= '2026-01-01'
GROUP BY event_date
ORDER BY event_date;

Use case: a company runs applications on AWS (data in S3) but uses BigQuery and Looker for analytics. BigQuery Omni eliminates the $0.09/GB egress cost of copying data to GCS while still using BigQuery's query engine and BI tools.

Limitations: Omni queries are slower than native BigQuery (network between clouds), do not support all BigQuery features, and are not suitable for latency-sensitive workloads.


Cost Optimization

Check Query Cost Before Running (--dry_run)

from google.cloud import bigquery

client = bigquery.Client()

# Estimate bytes scanned BEFORE running a query
job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)
query = """
SELECT user_id, COUNT(*) AS events
FROM events.user_events
WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY user_id
"""

dry_run_job = client.query(query, job_config=job_config)
bytes_scanned = dry_run_job.total_bytes_processed
cost_usd = bytes_scanned / 1e12 * 5 # $5 per TB

print(f"Bytes scanned: {bytes_scanned / 1e9:.2f} GB")
print(f"Estimated cost: ${cost_usd:.4f}")

Materialized Views for Expensive Repeated Queries

-- Create a materialized view that BigQuery refreshes automatically
CREATE MATERIALIZED VIEW features.daily_user_stats
PARTITION BY event_date
CLUSTER BY user_id
OPTIONS (enable_refresh = true, refresh_interval_minutes = 60)
AS
SELECT
user_id,
DATE(event_timestamp) AS event_date,
COUNT(*) AS total_events,
SUM(revenue) AS total_revenue,
COUNT(DISTINCT session_id) AS unique_sessions
FROM events.user_events
WHERE event_timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 90 DAY)
GROUP BY 1, 2;

-- Queries that match this pattern automatically use the materialized view
-- (BigQuery rewrites the query transparently)
SELECT user_id, SUM(total_events)
FROM events.user_events
WHERE DATE(event_timestamp) >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
GROUP BY user_id;
-- ^ BigQuery rewrites this to scan features.daily_user_stats instead

Practical Cost Rules

  1. Always filter on partition column - never compute the date from a non-partition column in the WHERE clause
  2. Select only needed columns - BigQuery charges by bytes scanned, not rows; SELECT * scans all columns
  3. Use approximate functions for exploratory work - APPROX_COUNT_DISTINCT vs COUNT(DISTINCT) can be 10x cheaper
  4. Cache query results - identical queries within 24 hours are free (result caching is on by default)
  5. Use clustering for secondary filters - after partition pruning, clustering further reduces scan within partitions

:::danger SELECT * on BigQuery Is Expensive BigQuery charges per byte scanned. A table with 50 columns and 1 TB of data charges $5 for SELECT * even if you only use 2 columns. Always explicitly name the columns you need. This single habit can reduce BigQuery costs by 30–70% for typical analytical workloads. :::

:::warning Streaming Inserts Are Not Immediately DML-Eligible Data inserted via the legacy streaming API has a small window (minutes to hours) during which it cannot be updated or deleted via DML (UPDATE, DELETE, MERGE). If your pipeline reads newly-inserted data and immediately tries to UPDATE or MERGE it, use the Storage Write API's pending stream mode or add a delay before DML operations. :::


Interview Q&A

Q1: Explain BigQuery's architecture. Why can it run a 2 TB query in 4 seconds?

BigQuery is built on three infrastructure components: Dremel (the distributed query engine), Colossus (the distributed storage layer), and Jupiter (Google's petabit-speed internal network).

When a query arrives, Dremel decomposes it into thousands of smaller operations and distributes them across hundreds to thousands of worker nodes (slots). Each slot reads a small chunk of Colossus-stored data and processes it. Because Jupiter's network bandwidth between compute and storage is massive (petabits per second bisection bandwidth), data can be streamed to thousands of compute workers simultaneously. The result aggregates back up the Dremel tree.

Traditional databases bottleneck on shared storage or network bandwidth. BigQuery's Jupiter network eliminates the network bottleneck. Dremel's parallelism eliminates the compute bottleneck. The result is near-linear horizontal scalability - a query that takes 4 seconds at 2 TB might take 5 seconds at 20 TB (not 40 seconds), because more partitions can be processed in parallel.


Q2: A team complains their BigQuery costs doubled in 3 months without any architecture changes. How do you diagnose and fix this?

Start with the INFORMATION_SCHEMA views:

-- Find the most expensive queries in the last 30 days
SELECT
user_email,
query,
total_bytes_processed / 1e12 AS tb_scanned,
total_bytes_processed / 1e12 * 5 AS estimated_cost_usd,
creation_time
FROM `region-us`.INFORMATION_SCHEMA.JOBS
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
AND job_type = 'QUERY'
AND statement_type != 'SCRIPT'
ORDER BY total_bytes_processed DESC
LIMIT 20;

Typical root causes: (1) a new analyst running SELECT * without partition filters, (2) a BI tool changed its query template and lost partition pruning, (3) a new table was created without partitioning, (4) a pipeline started re-scanning historical data instead of incremental partitions.

Fix checklist: add partition filters to expensive queries, remove SELECT *, add/rebuild partitioning on hot tables, create materialized views for expensive recurring aggregations.


Q3: When should you use BigQuery ML instead of training a model in Python with scikit-learn or LightGBM?

BigQuery ML makes sense when: your team is SQL-native with limited Python expertise, training data is already in BigQuery and exporting it would take significant time, you need to score data at BigQuery scale without setting up a serving infrastructure, and you need a rapid baseline model for stakeholder buy-in.

BigQuery ML is the wrong choice when: you need custom architectures (transformers, custom loss functions), hyperparameter tuning at scale (BigQuery ML's tuning is limited), sub-second online prediction latency (BigQuery ML scoring has seconds-level latency), or full control over training infrastructure (GPU/TPU, distributed training).

The practical heuristic: use BigQuery ML for SQL analysts building analytical models, use Vertex AI custom training for ML engineers building production models. BigQuery ML's value is eliminating Python and data movement - if you already have a Python workflow and a data movement pipeline, its advantages are smaller.


Q4: Explain partitioning vs. clustering in BigQuery. When does each help?

Partitioning divides a table into physical segments based on a column value - typically a date. BigQuery skips entire partitions that don't match the WHERE clause filter. This is metadata-level pruning - BigQuery doesn't read any data from pruned partitions.

Clustering sorts data within each partition by the specified columns. BigQuery can skip data blocks within a partition that don't contain matching values. This is block-level pruning - it reduces bytes scanned within the partitions that partitioning already selected.

They are complementary. The partition column should be your primary time-based filter (almost always a date or timestamp). Cluster columns should be the additional filters you apply most often - typically user_id, event_type, or similar high-cardinality categoricals.

When partitioning alone helps: queries always filter by date but have no secondary filters. When clustering adds value: queries filter by date AND by secondary columns - the secondary filter further reduces scan within the partition. When clustering alone (without partitioning) makes sense: integer range partitioning + cluster by user_id for user-centric data without a strong time dimension.


Q5: A BigQuery ML model has 85% accuracy in evaluation but performs poorly in production. What are the likely causes?

Training-serving skew is the most common cause. The training query pulls historical labeled data from BigQuery - but in production, the same features may be computed differently, have different null rates, or be computed at a different lag.

In BigQuery ML, the model is defined by a CREATE MODEL ... AS SELECT statement. If the production scoring query (ML.PREDICT) selects features from a different table, with different transformations, or at a different point in time (e.g., training on end-of-day features but scoring on real-time features), the model sees a different distribution than it trained on.

Diagnostic steps: compare the feature distributions of training data vs. production scoring data using APPROX_QUANTILES and COUNT(DISTINCT). Check for columns that are null in production but populated in training. Check for time-based features that are computed relative to event timestamps that shift between training and serving. Use ML.FEATURE_INFO to inspect expected feature statistics and compare to production.


Q6: What is BigQuery Omni and when would you recommend it to a team?

BigQuery Omni is BigQuery's multi-cloud capability - it lets you run BigQuery SQL queries against data stored in AWS S3 or Azure Blob Storage, without copying the data to Google Cloud. The compute runs in Google-managed infrastructure deployed in the same cloud region as your data.

Recommend Omni when: a team's data primarily lives in AWS or Azure (they can't or won't migrate), but they want BigQuery and its connected BI tools (Looker, Connected Sheets). The alternative - replicating data to GCS for BigQuery - incurs data transfer costs and pipeline complexity. Omni eliminates both.

Do not recommend Omni when: query latency is critical (Omni queries are slower than native BigQuery due to cross-cloud network), the workload is high-frequency (Omni adds per-query overhead), or the team needs full BigQuery feature parity (some BigQuery ML and advanced features don't work with Omni external tables).

The business case for Omni is strongest when a company uses Google Workspace and Looker for analytics but runs applications on AWS - common in enterprises that adopted G Suite but built infrastructure on AWS before GCP caught up.

© 2026 EngineersOfAI. All rights reserved.