BigQuery architecture, ML built-in functions, and BigQuery ML.

How does bigquery ml work in practice?

Google BigQuery covers bigquery, bigquery ml, dremel from first principles with code examples. Free lesson at https://engineersofai.com/docs/data-engineering/cloud-data-platforms/google-bigquery

What is the difference between bigquery and dremel?

See the full breakdown at https://engineersofai.com/docs/data-engineering/cloud-data-platforms/google-bigquery

:::tip 🎮 Interactive Playground Visualize this concept: Try the Spark Batch Processing demo on the EngineersOfAI Playground - no code required. :::

Google BigQuery

A startup ran a query scanning 2 TB of raw events. Their Redshift cluster would have taken 8 minutes and cost roughly $0.11 - that is the EC2 cost for 8 minutes on their ra3.xlplus nodes. BigQuery ran it in 4 seconds and charged$ 10. Both numbers are correct. BigQuery's serverless model shines for infrequent large scans run by teams that cannot afford to babysit a cluster - but for predictable, heavy, always-on workloads, the math flips entirely.

That tension - instant, zero-management queries at a premium per-scan price versus efficient reserved capacity for predictable workloads - is the central trade-off of BigQuery. Understanding it is the difference between a $200/month BigQuery bill and a$ 20,000/month BigQuery bill for the same data and queries.

This lesson covers BigQuery's internal architecture (the Dremel engine, Colossus storage, and Jupiter network), its pricing models, BigQuery ML for in-SQL model training, advanced query optimization through partitioning and clustering, its deep integration with Vertex AI, and the operational patterns that separate teams spending efficiently from those spending recklessly.

Why This Exists - The On-Demand Analytics Problem

The traditional analytics workflow required provisioning capacity in advance. You estimated peak query load, bought servers or rented cloud VMs sized for that peak, and ran queries against that fixed fleet. The servers ran 24/7. On Monday morning when analysts arrived and fired up 40 concurrent queries, they competed for the same fixed resources. On Saturday night, the servers sat idle at full cost.

Google faced this problem at a scale no other company had encountered. In 2006, internal teams were running ad hoc queries across petabytes of web crawl data, ad click logs, and search indices. There was no commercial product that worked at that scale. The solution was Dremel - a distributed query engine that could fan out a query across thousands of machines, execute it in seconds, and release those machines immediately.

BigQuery is the managed, public version of Dremel. It abstracts away all infrastructure - there are no clusters to provision, no nodes to configure, no auto-scaling groups to set up. You write SQL, submit it, and get results. The infrastructure layer is entirely Google's problem.

BigQuery Architecture

Dremel - The Query Engine

Dremel executes queries using a multi-level serving tree. A root server receives the query and distributes it to intermediate servers, which further distribute to leaf servers. Leaf servers scan actual Colossus data files. Results aggregate back up the tree.

The key insight: Dremel decomposes a query into thousands of smaller operations and assigns them to slots - the unit of compute in BigQuery. Each slot is roughly one vCPU. On-demand queries get up to 2,000 slots. Reservations let you purchase dedicated slot capacity.

Colossus - The Storage Layer

BigQuery stores data in Colossus, Google's distributed file system, in a format called Capacitor - a columnar format similar to Parquet. Colossus automatically handles replication, encryption, and geographic distribution. You never interact with it directly.

The columnar format is critical for BigQuery's pricing model: BigQuery charges for bytes scanned per query. Columnar storage means a query that only touches 3 out of 100 columns scans approximately 3% of the data - a 97% cost reduction compared to row-oriented storage.

Jupiter - The Network

Jupiter is Google's custom network fabric connecting Dremel and Colossus. Its bisection bandwidth exceeds 1 petabit per second. This is what makes it possible for a query to read 2 TB from storage in under 10 seconds - the network is fast enough to stream all that data to compute nodes simultaneously.

Pricing Models

This is the most practically important thing to understand about BigQuery.

On-Demand Pricing

You pay $5 per TB of data scanned by queries. There are no hourly charges for having a dataset - you only pay when queries run.

When it wins: infrequent, ad hoc queries by analysts. A team running 50 queries per day across 100 GB partitions spends roughly $25/day. No cluster to maintain.

When it loses: always-on ETL pipelines scanning terabytes per hour. A pipeline scanning 10 TB per hour, 24 hours a day, costs $1,200/day =$ 36,000/month on-demand.

Flat-Rate Reservations (Legacy)

Purchase a fixed number of slots (100 slots = ~$2,000/month). Queries consume slots from your reservation rather than on-demand capacity.

Break-even: if your on-demand spend exceeds the flat-rate cost for equivalent slots, switch.

BigQuery Editions (Current)

BigQuery's current pricing model uses three editions - Standard, Enterprise, and Enterprise Plus - each with different slot autoscaling and commitment options.

Edition	Autoscale	Commitment	Best For
Standard	Yes, per-second billing	None required	Moderate, variable workloads
Enterprise	Yes, per-second billing	1-year or 3-year for discount	Predictable heavy workloads
Enterprise Plus	Yes	3-year commitment	Largest workloads, compliance

:::tip Choosing a Pricing Model Start with on-demand. Measure your monthly query costs. If costs are predictable and exceed $2,000–3,000/month, model out whether a reservation pays off. Use the BigQuery Reservations API to set baseline + autoscaling capacity - you get predictable minimums with burst capacity for peaks. :::

Partitioning and Clustering - The Most Impactful Optimization

Partitioning

Partitioned tables split data into segments based on a column value. BigQuery only reads partitions that match the query filter - directly reducing bytes scanned.

-- Create a partitioned table (partition by event date)
CREATE TABLE events.user_events_partitioned (
    event_id      STRING,
    user_id       INT64,
    event_type    STRING,
    revenue       FLOAT64,
    event_date    DATE
)
PARTITION BY event_date
OPTIONS (
    partition_expiration_days = 365  -- auto-expire old partitions
);

-- This query scans ONLY the 2026-03-11 partition (~5 GB vs 2 TB full table)
SELECT user_id, COUNT(*) AS events
FROM events.user_events_partitioned
WHERE event_date = '2026-03-11'      -- partition filter: BigQuery prunes all other partitions
GROUP BY user_id;

-- Always use _PARTITIONTIME or the partition column in WHERE clauses
-- This query accidentally scans ALL partitions (no pruning):
SELECT * FROM events.user_events_partitioned
WHERE DATE(event_timestamp) = '2026-03-11';  -- WRONG: computed column, not the partition col

Partition types:

PARTITION BY DATE(column) - time-unit column partitioning (most common)
PARTITION BY RANGE_BUCKET(column, GENERATE_ARRAY(0, 100, 10)) - integer range partitioning
PARTITION BY _PARTITIONDATE - ingestion-time partitioning (automatic)

Clustering

Clustering sorts data within each partition by the specified columns. BigQuery can skip data blocks that don't match cluster column filters - reducing bytes scanned further within a partition.

-- Partition by date, cluster by user_id and event_type
-- BigQuery will physically sort data within each partition by these columns
CREATE TABLE events.user_events_optimized
PARTITION BY event_date
CLUSTER BY user_id, event_type
AS SELECT * FROM events.user_events_raw;

-- This query benefits from BOTH partition pruning AND cluster pruning:
SELECT revenue, event_type
FROM events.user_events_optimized
WHERE event_date BETWEEN '2026-01-01' AND '2026-03-11'  -- partition filter
  AND user_id = 12345                                     -- cluster filter
  AND event_type = 'purchase';                            -- cluster filter

Cost before vs. after (illustrative):

Query	Scanned (no partition/cluster)	Scanned (partitioned + clustered)	Cost Reduction
Single-day user events	2 TB	8 GB	99.6%
User-specific 30-day history	2 TB	500 MB	99.97%
All events for an event type	2 TB	50 GB	97.5%

BigQuery ML - Training Models in SQL

BigQuery ML lets you train, evaluate, and serve ML models using SQL syntax - no Python required, no data export. Training data stays in BigQuery. This is powerful for SQL-native teams and for models that need to score data at BigQuery scale.

Creating and Training a Model

-- Train a logistic regression model for user churn prediction
CREATE OR REPLACE MODEL `ml_models.churn_classifier`
OPTIONS (
    model_type = 'LOGISTIC_REG',
    input_label_cols = ['churned'],
    l2_reg = 0.1,
    max_iterations = 100,
    learn_rate_strategy = 'line_search',
    data_split_method = 'RANDOM',
    data_split_eval_fraction = 0.2
) AS
SELECT
    events_7d,
    events_30d,
    revenue_30d,
    days_since_purchase,
    account_age_days,
    subscription_tier,
    churned
FROM `features.user_churn_features`
WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY);

-- Evaluate model performance
SELECT *
FROM ML.EVALUATE(MODEL `ml_models.churn_classifier`,
    (SELECT * FROM `features.user_churn_features`
     WHERE event_date = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)));

-- Returns: precision, recall, accuracy, f1_score, roc_auc, log_loss

-- Generate predictions at scale (scores the entire feature table)
SELECT
    user_id,
    predicted_churned,
    predicted_churned_probs[OFFSET(1)].prob AS churn_probability
FROM ML.PREDICT(MODEL `ml_models.churn_classifier`,
    (SELECT * FROM `features.user_churn_features`
     WHERE event_date = CURRENT_DATE()))
ORDER BY churn_probability DESC;

Supported Model Types in BigQuery ML

Model Type	SQL option	Use Case
Linear regression	`LINEAR_REG`	Revenue prediction, continuous targets
Logistic regression	`LOGISTIC_REG`	Binary/multiclass classification
Boosted trees (XGBoost)	`BOOSTED_TREE_CLASSIFIER` / `BOOSTED_TREE_REGRESSOR`	Tabular ML, robust baseline
Deep neural network	`DNN_CLASSIFIER` / `DNN_REGRESSOR`	Complex non-linear patterns
AutoML (Vertex AI)	`AUTOML_CLASSIFIER`	Automated model selection
K-means clustering	`KMEANS`	Unsupervised segmentation
Matrix factorization	`MATRIX_FACTORIZATION`	Collaborative filtering / recommendations
Time series (ARIMA+)	`ARIMA_PLUS`	Demand forecasting, anomaly detection
Imported TensorFlow	`TENSORFLOW`	Custom TF model served in BigQuery

When to Use BigQuery ML vs. Vertex AI

Consideration	BigQuery ML	Vertex AI
Data stays in BigQuery	Yes	No (must export)
SQL-native teams	Excellent	Requires Python
Custom model architectures	Limited	Full flexibility
Training compute	BigQuery slots	Dedicated GPU/TPU VMs
Production serving	BigQuery (high latency)	Low-latency REST endpoint
Model explainability	Built-in (`ML.GLOBAL_EXPLAIN`)	Explainable AI on Vertex

Feature Engineering for ML in BigQuery SQL

ARRAY and STRUCT for Nested Features

BigQuery natively supports nested and repeated fields - arrays and structs - without flattening. This is ideal for representing sequence features:

-- User-level features with ARRAY aggregations (session sequences)
SELECT
    user_id,
    -- Aggregate last 10 session durations as an array (for sequence models)
    ARRAY_AGG(session_duration_seconds ORDER BY session_start DESC LIMIT 10)
        AS recent_session_durations,
    -- Struct of aggregated stats
    STRUCT(
        COUNT(*)            AS total_sessions,
        AVG(session_duration_seconds) AS avg_duration,
        MAX(session_duration_seconds) AS max_duration
    ) AS session_stats
FROM events.sessions
WHERE session_start >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY user_id;

Window Functions for Time-Series Features

-- Rolling features with BigQuery window functions
WITH daily_events AS (
    SELECT
        user_id,
        DATE(event_timestamp)  AS event_date,
        COUNT(*)               AS daily_events,
        SUM(revenue)           AS daily_revenue
    FROM events.user_events
    WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
    GROUP BY 1, 2
)

SELECT
    user_id,
    event_date,
    -- Rolling 7-day sum
    SUM(daily_events) OVER (
        PARTITION BY user_id
        ORDER BY event_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS events_7d,

    -- Rolling 30-day revenue
    SUM(daily_revenue) OVER (
        PARTITION BY user_id
        ORDER BY event_date
        ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
    ) AS revenue_30d,

    -- Lag features (yesterday's activity)
    LAG(daily_events, 1) OVER (PARTITION BY user_id ORDER BY event_date) AS events_yesterday,

    -- Days since last purchase
    DATE_DIFF(
        event_date,
        LAST_VALUE(CASE WHEN daily_revenue > 0 THEN event_date END IGNORE NULLS)
            OVER (PARTITION BY user_id ORDER BY event_date
                  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
        DAY
    ) AS days_since_purchase

FROM daily_events;

Approximate Aggregations for Scale

For very large datasets where exact counts are too expensive:

-- APPROX functions trade tiny accuracy loss for massive speed gains
SELECT
    event_date,
    COUNT(DISTINCT user_id)            AS exact_unique_users,    -- expensive on 10B rows
    APPROX_COUNT_DISTINCT(user_id)     AS approx_unique_users,   -- ~2% error, 10x faster

    APPROX_QUANTILES(revenue, 100)[OFFSET(50)]  AS median_revenue,
    APPROX_QUANTILES(revenue, 100)[OFFSET(95)]  AS p95_revenue,
    APPROX_QUANTILES(revenue, 100)[OFFSET(99)]  AS p99_revenue,

    APPROX_TOP_COUNT(event_type, 5)    AS top_5_event_types
FROM events.user_events
WHERE event_date = CURRENT_DATE()
GROUP BY event_date;

BigQuery + Vertex AI Integration

For ML workflows that outgrow BigQuery ML's model types, BigQuery integrates natively with Vertex AI.

Pattern 1: Export Training Data from BigQuery to Vertex AI

from google.cloud import bigquery, aiplatform

# Step 1: Export training data from BigQuery to GCS
bq_client = bigquery.Client(project="my-project")

export_job = bq_client.extract_table(
    "my-project.features.user_churn_features",
    "gs://my-bucket/training-data/churn-features-*.csv",
    job_config=bigquery.ExtractJobConfig(
        destination_format=bigquery.DestinationFormat.CSV,
        print_header=True
    )
)
export_job.result()
print("Export complete")

# Step 2: Trigger Vertex AI training job
aiplatform.init(project="my-project", location="us-central1")

job = aiplatform.CustomTrainingJob(
    display_name="churn-classifier-v2",
    script_path="trainer/train.py",
    container_uri="us-docker.pkg.dev/vertex-ai/training/scikit-learn-cpu.1-0:latest",
    requirements=["lightgbm==4.3.0"],
    model_serving_container_image_uri=(
        "us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.1-0:latest"
    )
)

model = job.run(
    dataset=None,
    model_display_name="churn-classifier-v2",
    args=[
        "--training-data-uri", "gs://my-bucket/training-data/churn-features-*.csv",
        "--n-estimators", "500",
        "--learning-rate", "0.05"
    ],
    replica_count=1,
    machine_type="n1-standard-8"
)

Pattern 2: Import Vertex AI Predictions Back to BigQuery

# After model is deployed, batch-score from BigQuery and write predictions back

from google.cloud import bigquery, aiplatform
import pandas as pd

# Export scoring data
bq_client = bigquery.Client(project="my-project")
score_df = bq_client.query("""
    SELECT user_id, events_7d, events_30d, revenue_30d,
           days_since_purchase, account_age_days, subscription_tier
    FROM features.user_churn_features
    WHERE event_date = CURRENT_DATE()
""").to_dataframe()

# Batch predict via Vertex AI endpoint
endpoint = aiplatform.Endpoint("projects/my-project/locations/us-central1/endpoints/123")
predictions = endpoint.predict(instances=score_df.drop("user_id", axis=1).to_dict("records"))

# Write predictions back to BigQuery
score_df["churn_probability"] = [p["churn_probability"] for p in predictions.predictions]
score_df["scored_at"] = pd.Timestamp.now(tz="UTC")

bq_client.load_table_from_dataframe(
    score_df[["user_id", "churn_probability", "scored_at"]],
    "my-project.predictions.churn_scores",
    job_config=bigquery.LoadJobConfig(write_disposition="WRITE_APPEND")
).result()

BigQuery ML with Remote Models (Vertex AI Integration)

BigQuery now supports remote models - you register a Vertex AI endpoint as a BigQuery model and call it from SQL:

-- Register a Vertex AI endpoint as a BigQuery remote model
CREATE OR REPLACE MODEL `ml_models.churn_vertex_remote`
REMOTE WITH CONNECTION `us.vertex-connection`
OPTIONS (endpoint = 'https://us-central1-aiplatform.googleapis.com/v1/projects/my-project/locations/us-central1/endpoints/123456');

-- Score directly from SQL using the Vertex AI model
SELECT user_id, predicted_label, score
FROM ML.PREDICT(MODEL `ml_models.churn_vertex_remote`,
    (SELECT user_id, events_7d, events_30d, revenue_30d
     FROM features.user_churn_features
     WHERE event_date = CURRENT_DATE()));

Streaming Inserts vs. Storage Write API

Legacy Streaming Inserts

The original BigQuery streaming API inserted rows in real time but was expensive ($0.01 per 200 MB) and had availability limitations. Data appeared within seconds but was not available for DML operations immediately.

from google.cloud import bigquery

client = bigquery.Client()
table_id = "my-project.events.user_events"

# Legacy streaming insert (still works, more expensive)
errors = client.insert_rows_json(table_id, [
    {"user_id": 123, "event_type": "purchase", "revenue": 49.99,
     "event_timestamp": "2026-03-12T10:30:00Z"}
])
if errors:
    raise Exception(f"Insert errors: {errors}")

Storage Write API (Recommended)

The Storage Write API is the modern approach - lower cost ($0.025 per GB), transactional semantics, and exactly-once delivery:

from google.cloud.bigquery_storage_v1 import BigQueryWriteClient, types
from google.protobuf import descriptor_pool, descriptor_pb2
import json

# Storage Write API: higher throughput, lower cost, exactly-once semantics
write_client = BigQueryWriteClient()

# Create a write stream
parent = write_client.table_path("my-project", "events", "user_events")
write_stream = types.WriteStream()
write_stream.type_ = types.WriteStream.Type.COMMITTED  # immediately visible

response = write_client.create_write_stream(
    parent=parent, write_stream=write_stream
)
stream_name = response.name

# Append rows
# (In practice, use the bigquery-storage Python library's AppendRowsStream)
print(f"Stream created: {stream_name}")

Feature	Legacy Streaming	Storage Write API
Cost	$0.01/200 MB	$0.025/GB
Throughput	Low	High
Exactly-once	No	Yes
Transactions	No	Yes (pending streams)
Recommended	No	Yes

BigQuery Omni - Federated Queries Across Clouds

BigQuery Omni lets you query data in AWS S3 or Azure Blob Storage without copying it to Google Cloud. Compute runs in the other cloud (BigQuery uses Google-managed infrastructure in AWS/Azure regions).

-- Create a BigQuery Omni connection to S3
-- (configured via BigQuery Connection API or Console)

-- Query S3 data from BigQuery (data stays in AWS)
SELECT
    event_date,
    COUNT(*)    AS event_count,
    SUM(revenue) AS total_revenue
FROM `my-project.aws_dataset.s3_user_events`   -- external table pointing to S3
WHERE event_date >= '2026-01-01'
GROUP BY event_date
ORDER BY event_date;

Use case: a company runs applications on AWS (data in S3) but uses BigQuery and Looker for analytics. BigQuery Omni eliminates the $0.09/GB egress cost of copying data to GCS while still using BigQuery's query engine and BI tools.

Limitations: Omni queries are slower than native BigQuery (network between clouds), do not support all BigQuery features, and are not suitable for latency-sensitive workloads.

Cost Optimization

Check Query Cost Before Running (--dry_run)

from google.cloud import bigquery

client = bigquery.Client()

# Estimate bytes scanned BEFORE running a query
job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)
query = """
    SELECT user_id, COUNT(*) AS events
    FROM events.user_events
    WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
    GROUP BY user_id
"""

dry_run_job = client.query(query, job_config=job_config)
bytes_scanned = dry_run_job.total_bytes_processed
cost_usd = bytes_scanned / 1e12 * 5  # $5 per TB

print(f"Bytes scanned: {bytes_scanned / 1e9:.2f} GB")
print(f"Estimated cost: ${cost_usd:.4f}")

Materialized Views for Expensive Repeated Queries

-- Create a materialized view that BigQuery refreshes automatically
CREATE MATERIALIZED VIEW features.daily_user_stats
PARTITION BY event_date
CLUSTER BY user_id
OPTIONS (enable_refresh = true, refresh_interval_minutes = 60)
AS
SELECT
    user_id,
    DATE(event_timestamp)   AS event_date,
    COUNT(*)                AS total_events,
    SUM(revenue)            AS total_revenue,
    COUNT(DISTINCT session_id) AS unique_sessions
FROM events.user_events
WHERE event_timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 90 DAY)
GROUP BY 1, 2;

-- Queries that match this pattern automatically use the materialized view
-- (BigQuery rewrites the query transparently)
SELECT user_id, SUM(total_events)
FROM events.user_events
WHERE DATE(event_timestamp) >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
GROUP BY user_id;
-- ^ BigQuery rewrites this to scan features.daily_user_stats instead

Practical Cost Rules

Always filter on partition column - never compute the date from a non-partition column in the WHERE clause
Select only needed columns - BigQuery charges by bytes scanned, not rows; SELECT * scans all columns
Use approximate functions for exploratory work - APPROX_COUNT_DISTINCT vs COUNT(DISTINCT) can be 10x cheaper
Cache query results - identical queries within 24 hours are free (result caching is on by default)
Use clustering for secondary filters - after partition pruning, clustering further reduces scan within partitions

:::danger SELECT * on BigQuery Is Expensive BigQuery charges per byte scanned. A table with 50 columns and 1 TB of data charges $5 for SELECT * even if you only use 2 columns. Always explicitly name the columns you need. This single habit can reduce BigQuery costs by 30–70% for typical analytical workloads. :::

:::warning Streaming Inserts Are Not Immediately DML-Eligible Data inserted via the legacy streaming API has a small window (minutes to hours) during which it cannot be updated or deleted via DML (UPDATE, DELETE, MERGE). If your pipeline reads newly-inserted data and immediately tries to UPDATE or MERGE it, use the Storage Write API's pending stream mode or add a delay before DML operations. :::

Interview Q&A

Q1: Explain BigQuery's architecture. Why can it run a 2 TB query in 4 seconds?

BigQuery is built on three infrastructure components: Dremel (the distributed query engine), Colossus (the distributed storage layer), and Jupiter (Google's petabit-speed internal network).

When a query arrives, Dremel decomposes it into thousands of smaller operations and distributes them across hundreds to thousands of worker nodes (slots). Each slot reads a small chunk of Colossus-stored data and processes it. Because Jupiter's network bandwidth between compute and storage is massive (petabits per second bisection bandwidth), data can be streamed to thousands of compute workers simultaneously. The result aggregates back up the Dremel tree.

Traditional databases bottleneck on shared storage or network bandwidth. BigQuery's Jupiter network eliminates the network bottleneck. Dremel's parallelism eliminates the compute bottleneck. The result is near-linear horizontal scalability - a query that takes 4 seconds at 2 TB might take 5 seconds at 20 TB (not 40 seconds), because more partitions can be processed in parallel.

Q2: A team complains their BigQuery costs doubled in 3 months without any architecture changes. How do you diagnose and fix this?

Start with the INFORMATION_SCHEMA views:

-- Find the most expensive queries in the last 30 days
SELECT
    user_email,
    query,
    total_bytes_processed / 1e12 AS tb_scanned,
    total_bytes_processed / 1e12 * 5 AS estimated_cost_usd,
    creation_time
FROM `region-us`.INFORMATION_SCHEMA.JOBS
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
  AND job_type = 'QUERY'
  AND statement_type != 'SCRIPT'
ORDER BY total_bytes_processed DESC
LIMIT 20;

Typical root causes: (1) a new analyst running SELECT * without partition filters, (2) a BI tool changed its query template and lost partition pruning, (3) a new table was created without partitioning, (4) a pipeline started re-scanning historical data instead of incremental partitions.

Fix checklist: add partition filters to expensive queries, remove SELECT *, add/rebuild partitioning on hot tables, create materialized views for expensive recurring aggregations.

Q3: When should you use BigQuery ML instead of training a model in Python with scikit-learn or LightGBM?

BigQuery ML makes sense when: your team is SQL-native with limited Python expertise, training data is already in BigQuery and exporting it would take significant time, you need to score data at BigQuery scale without setting up a serving infrastructure, and you need a rapid baseline model for stakeholder buy-in.

BigQuery ML is the wrong choice when: you need custom architectures (transformers, custom loss functions), hyperparameter tuning at scale (BigQuery ML's tuning is limited), sub-second online prediction latency (BigQuery ML scoring has seconds-level latency), or full control over training infrastructure (GPU/TPU, distributed training).

The practical heuristic: use BigQuery ML for SQL analysts building analytical models, use Vertex AI custom training for ML engineers building production models. BigQuery ML's value is eliminating Python and data movement - if you already have a Python workflow and a data movement pipeline, its advantages are smaller.

Q4: Explain partitioning vs. clustering in BigQuery. When does each help?

Partitioning divides a table into physical segments based on a column value - typically a date. BigQuery skips entire partitions that don't match the WHERE clause filter. This is metadata-level pruning - BigQuery doesn't read any data from pruned partitions.

Clustering sorts data within each partition by the specified columns. BigQuery can skip data blocks within a partition that don't contain matching values. This is block-level pruning - it reduces bytes scanned within the partitions that partitioning already selected.

They are complementary. The partition column should be your primary time-based filter (almost always a date or timestamp). Cluster columns should be the additional filters you apply most often - typically user_id, event_type, or similar high-cardinality categoricals.

When partitioning alone helps: queries always filter by date but have no secondary filters. When clustering adds value: queries filter by date AND by secondary columns - the secondary filter further reduces scan within the partition. When clustering alone (without partitioning) makes sense: integer range partitioning + cluster by user_id for user-centric data without a strong time dimension.

Q5: A BigQuery ML model has 85% accuracy in evaluation but performs poorly in production. What are the likely causes?

Training-serving skew is the most common cause. The training query pulls historical labeled data from BigQuery - but in production, the same features may be computed differently, have different null rates, or be computed at a different lag.

In BigQuery ML, the model is defined by a CREATE MODEL ... AS SELECT statement. If the production scoring query (ML.PREDICT) selects features from a different table, with different transformations, or at a different point in time (e.g., training on end-of-day features but scoring on real-time features), the model sees a different distribution than it trained on.

Diagnostic steps: compare the feature distributions of training data vs. production scoring data using APPROX_QUANTILES and COUNT(DISTINCT). Check for columns that are null in production but populated in training. Check for time-based features that are computed relative to event timestamps that shift between training and serving. Use ML.FEATURE_INFO to inspect expected feature statistics and compare to production.

Q6: What is BigQuery Omni and when would you recommend it to a team?

BigQuery Omni is BigQuery's multi-cloud capability - it lets you run BigQuery SQL queries against data stored in AWS S3 or Azure Blob Storage, without copying the data to Google Cloud. The compute runs in Google-managed infrastructure deployed in the same cloud region as your data.

Recommend Omni when: a team's data primarily lives in AWS or Azure (they can't or won't migrate), but they want BigQuery and its connected BI tools (Looker, Connected Sheets). The alternative - replicating data to GCS for BigQuery - incurs data transfer costs and pipeline complexity. Omni eliminates both.

Do not recommend Omni when: query latency is critical (Omni queries are slower than native BigQuery due to cross-cloud network), the workload is high-frequency (Omni adds per-query overhead), or the team needs full BigQuery feature parity (some BigQuery ML and advanced features don't work with Omni external tables).

The business case for Omni is strongest when a company uses Google Workspace and Looker for analytics but runs applications on AWS - common in enterprises that adopted G Suite but built infrastructure on AWS before GCP caught up.

Why This Exists - The On-Demand Analytics Problem​

BigQuery Architecture​

Dremel - The Query Engine​

Colossus - The Storage Layer​

Jupiter - The Network​

Pricing Models​

On-Demand Pricing​

Flat-Rate Reservations (Legacy)​

BigQuery Editions (Current)​

Partitioning and Clustering - The Most Impactful Optimization​

Partitioning​

Clustering​

BigQuery ML - Training Models in SQL​

Creating and Training a Model​

Supported Model Types in BigQuery ML​

When to Use BigQuery ML vs. Vertex AI​

Feature Engineering for ML in BigQuery SQL​

ARRAY and STRUCT for Nested Features​

Window Functions for Time-Series Features​

Approximate Aggregations for Scale​

BigQuery + Vertex AI Integration​

Pattern 1: Export Training Data from BigQuery to Vertex AI​

Pattern 2: Import Vertex AI Predictions Back to BigQuery​

BigQuery ML with Remote Models (Vertex AI Integration)​

Streaming Inserts vs. Storage Write API​

Legacy Streaming Inserts​

Storage Write API (Recommended)​

BigQuery Omni - Federated Queries Across Clouds​

Cost Optimization​

Check Query Cost Before Running (--dry_run)​

Materialized Views for Expensive Repeated Queries​

Practical Cost Rules​

Interview Q&A​