Skip to main content

:::tip 🎮 Interactive Playground Visualize this concept: Try the Build vs Buy demo on the EngineersOfAI Playground - no code required. :::

Multi-Cloud Data Strategies for AI Workloads

A company kept their data in AWS because that is where it had always lived. When they started using Google's Vertex AI for ML - it was demonstrably better than SageMaker for their use case, with superior AutoML, better GPU pricing, and a more capable Feature Store - they hit a problem nobody had anticipated during the evaluation: data egress costs.

Google Cloud charges 0.09/GBfordataegressfromAWStoGCP.Moving5TBoftrainingdataforeachexperimentcost0.09/GB for data egress from AWS to GCP. Moving 5 TB of training data for each experiment cost 450 per experiment run. The ML team ran 200 experiments per quarter. That is $90,000 per year in data transfer fees - for doing nothing more than moving bytes from one cloud to another.

A multi-cloud data architecture with data colocation would have saved 80,000.ThetrainingdatawouldhavebeenreplicatedoncetoGCSa80,000. The training data would have been replicated once to GCS - a 450 one-time cost - and subsequent experiment runs would read locally from GCP at zero egress cost. The architectural decision to keep all data in AWS, made before Vertex AI was chosen, was costing $80,000 per year.

This lesson covers why multi-cloud happens, the data gravity problem, how to model egress costs, the open format strategies that enable multi-cloud without vendor lock-in, and when single-cloud with open formats is actually the right answer.


Why This Exists - Multi-Cloud Is Not a Choice, It Is a Consequence

Organizations rarely choose multi-cloud deliberately. Multi-cloud happens because:

Acquisitions: Company A runs on AWS. Company B (just acquired) runs on Azure. Migrating Company B's infrastructure takes 2 years and costs $5 million. The combined entity is multi-cloud for the foreseeable future.

Best-of-breed services: GCP has better ML (Vertex AI, TPUs). AWS has better networking (CloudFront, PrivateLink). Azure has better Active Directory integration. Companies that evaluate services on merit end up using multiple clouds.

Regulatory requirements: Data must stay in the EU (GDPR). The ML training infrastructure is in US East. The data center that serves EU customers is in Frankfurt on Azure. The ML services are in GCP. Three clouds, not by preference but by legal requirement.

Organizational independence: different business units made different cloud decisions before there was centralized governance.

The question is never "should we be multi-cloud" - it is "how do we manage data across the clouds we already use, as cheaply and correctly as possible?"


The Multi-Cloud Data Problem

The multi-cloud data problem has four dimensions:

Data gravity: data is attracted to compute. When you have 10 TB of data in AWS S3, the natural choice is to process it using AWS compute (Glue, EMR, Athena). Moving it to GCP for ML training means 10 TB of egress at 0.09/GB=0.09/GB = 900 per move. If you move it 100 times a year, that is $90,000 in egress costs that appear nowhere in the original architecture decision.

Consistency: data updated in AWS must be reflected in GCP. Real-time consistency requires synchronization - either continuous replication (expensive) or eventual consistency (stale training data).

Governance: access control in AWS uses IAM. Access control in GCP uses IAM (different IAM). Access control in Azure uses Azure AD and RBAC. Implementing a unified governance policy across three incompatible identity systems requires either a layer of abstraction (Unity Catalog, Immuta) or constant manual reconciliation.

Operational complexity: debugging a pipeline that spans three clouds requires observability tools in three different environments. A failed data transfer that looks like a network timeout could be AWS throttling, GCP authentication, or a Kinesis shard issue - the debugging surface is 3x larger.


Data Egress Cost Modeling

Before designing a multi-cloud architecture, quantify the egress cost of every option:

# Cloud egress cost model - build this before making any multi-cloud architecture decision

class EgressCostModel:
# Egress pricing ($/GB) as of 2026 - verify current pricing at cloud provider sites
EGRESS_PRICES = {
("aws", "gcp"): 0.09,
("aws", "azure"): 0.08,
("gcp", "aws"): 0.08,
("gcp", "azure"): 0.08,
("azure", "aws"): 0.087,
("azure", "gcp"): 0.087,
# Same-cloud inter-region egress (lower, but non-zero)
("aws", "aws"): 0.02, # cross-region within AWS
("gcp", "gcp"): 0.01, # cross-region within GCP
("azure", "azure"): 0.02, # cross-region within Azure
}

# Same-cloud, same-region: free
FREE_ROUTES = {("aws", "aws", "same-region"), ("gcp", "gcp", "same-region")}

def egress_cost(self, source_cloud: str, dest_cloud: str, gb: float) -> float:
if source_cloud == dest_cloud:
rate = self.EGRESS_PRICES.get((source_cloud, dest_cloud), 0.0)
else:
rate = self.EGRESS_PRICES.get((source_cloud, dest_cloud), 0.09)
return gb * rate

def annual_experiment_cost(
self,
data_gb: float,
experiments_per_year: int,
source_cloud: str,
dest_cloud: str,
cache_strategy: str = "none"
) -> dict:
"""
Calculate annual egress cost for ML experiments.

cache_strategy:
'none' - move data fresh for every experiment
'replicate' - replicate once, pay one-time cost
'federate' - query data remotely, no egress (but slower)
"""
per_move = self.egress_cost(source_cloud, dest_cloud, data_gb)

if cache_strategy == "none":
annual_cost = per_move * experiments_per_year
one_time_cost = 0
elif cache_strategy == "replicate":
annual_cost = 0 # data is already in dest cloud
one_time_cost = per_move # pay once to replicate
# Also pay for destination storage
dest_storage_cost = data_gb * 0.02 * 12 # ~$0.02/GB/month avg
annual_cost = dest_storage_cost
elif cache_strategy == "federate":
annual_cost = 0 # federated query, no data movement
one_time_cost = 0

return {
"strategy": cache_strategy,
"per_experiment_egress": per_move,
"annual_egress_cost": annual_cost,
"one_time_cost": one_time_cost,
"total_year_1": annual_cost + one_time_cost,
"total_year_2": annual_cost # one-time cost already paid
}


model = EgressCostModel()

# The opening example: 5 TB training data, AWS → GCP, 200 experiments/year
scenarios = [
("none", "Move data fresh each time"),
("replicate", "Replicate to GCS once, keep current"),
("federate", "Use BigQuery Omni or Vertex AI with cross-cloud connector"),
]

print("=== 5 TB Training Data, 200 Experiments/Year, AWS → GCP ===\n")
for strategy, description in scenarios:
result = model.annual_experiment_cost(
data_gb=5_000,
experiments_per_year=200,
source_cloud="aws",
dest_cloud="gcp",
cache_strategy=strategy
)
print(f"{description}")
print(f" Year 1 total: ${result['total_year_1']:,.0f}")
print(f" Year 2+: ${result['total_year_2']:,.0f}\n")

Output:

=== 5 TB Training Data, 200 Experiments/Year, AWS → GCP ===

Move data fresh each time
Year 1 total: $90,000
Year 2+: $90,000

Replicate to GCS once, keep current
Year 1 total: $1,650 (one-time $450 egress + $1,200/year GCS storage)
Year 2+: $1,200

Use BigQuery Omni or Vertex AI with cross-cloud connector
Year 1 total: $0 (but query latency is higher)
Year 2+: $0

Apache Iceberg as the Multi-Cloud Format

The most powerful architectural choice for multi-cloud data is to standardize on an open, cloud-agnostic table format. Apache Iceberg is the leading choice.

Iceberg stores data in open Parquet or ORC files plus a metadata layer (JSON files) that any compatible engine can read. There is no proprietary binary format, no vendor lock-in, no conversion required to move between clouds. The same Iceberg table can be read by:

  • Athena (AWS)
  • BigQuery (GCP, via BigLake with Iceberg)
  • Spark on Databricks (any cloud)
  • Trino/Presto (self-hosted or managed)
  • Snowflake External Tables
# Write an Iceberg table to S3 with Spark (runs on AWS EMR, Databricks, or GCP Dataproc)
from pyspark.sql import SparkSession

spark = SparkSession.builder \
.config("spark.sql.extensions",
"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
.config("spark.sql.catalog.glue_catalog",
"org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.glue_catalog.catalog-impl",
"org.apache.iceberg.aws.glue.GlueCatalog") \
.config("spark.sql.catalog.glue_catalog.warehouse",
"s3://my-data-lake/iceberg/") \
.getOrCreate()

# Write Iceberg table to S3 - format is identical regardless of which cloud runs this
spark.sql("""
CREATE TABLE IF NOT EXISTS glue_catalog.features.user_churn_features (
user_id BIGINT,
events_7d INT,
events_30d INT,
revenue_30d DOUBLE,
days_since_purchase INT,
event_date DATE
)
USING iceberg
PARTITIONED BY (days(event_date))
LOCATION 's3://my-data-lake/iceberg/features/user_churn_features'
""")

# Write feature data
features_df.write \
.format("iceberg") \
.mode("append") \
.save("glue_catalog.features.user_churn_features")
# Read the same Iceberg table from GCP Dataproc (Google's managed Spark)
# Data is still in S3 - no replication, no format conversion
# (requires cross-cloud IAM access from GCP service account to S3)

spark_gcp = SparkSession.builder \
.config("spark.sql.extensions",
"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
.config("spark.sql.catalog.s3_catalog.type", "hadoop") \
.config("spark.sql.catalog.s3_catalog.warehouse",
"s3://my-data-lake/iceberg/") \
.getOrCreate()

# Query the S3-hosted Iceberg table from GCP
features = spark_gcp.table("s3_catalog.features.user_churn_features")
features.filter("event_date >= '2026-01-01'").show()

Iceberg Multi-Cloud Replication Strategy

When you need data available in multiple clouds with low latency (not just federation), Iceberg's metadata design enables efficient replication:

# Replicate Iceberg table metadata to GCS for GCP consumers
# Data files stay in S3; only metadata is duplicated (typically < 1% of total data size)

import boto3
from google.cloud import storage
import json

def replicate_iceberg_metadata(
s3_bucket: str,
s3_prefix: str,
gcs_bucket: str,
gcs_prefix: str
) -> None:
"""
Replicate Iceberg metadata files (not data files) from S3 to GCS.
GCP Spark jobs can then read metadata from GCS but data from S3.
"""
s3 = boto3.client("s3")
gcs_client = storage.Client()
gcs_bkt = gcs_client.bucket(gcs_bucket)

# List all Iceberg metadata files (tiny JSON files, not the Parquet data)
response = s3.list_objects_v2(Bucket=s3_bucket, Prefix=f"{s3_prefix}/metadata/")

for obj in response.get("Contents", []):
key = obj["Key"]
if key.endswith((".json", ".avro")): # only metadata, not data files
data = s3.get_object(Bucket=s3_bucket, Key=key)["Body"].read()
blob = gcs_bkt.blob(f"{gcs_prefix}/{key}")
blob.upload_from_string(data)
print(f"Replicated: {key}")

# Run after each write to the Iceberg table
replicate_iceberg_metadata(
s3_bucket="my-data-lake",
s3_prefix="iceberg/features/user_churn_features",
gcs_bucket="my-gcs-bucket",
gcs_prefix="iceberg-metadata/features/user_churn_features"
)

Delta Sharing - Cross-Cloud Live Data Sharing

Delta Sharing is an open protocol for sharing live Delta Lake or Iceberg data across organizations and clouds without copying data. A provider exposes a share endpoint; a consumer reads from it as if it were a local table.

# Provider side: configure Delta Sharing in Databricks Unity Catalog
# (or standalone Delta Sharing server)

# In Databricks SQL:
# CREATE SHARE feature_store_cross_cloud;
# ALTER SHARE feature_store_cross_cloud ADD TABLE ml_platform.features.user_churn_features;
# CREATE RECIPIENT gcp_ml_team USING ID 'activation-link-url';
# GRANT SELECT ON SHARE feature_store_cross_cloud TO RECIPIENT gcp_ml_team;
# Consumer side: read a Delta Share from Python (any cloud)
from delta_sharing import SharingClient, load_as_spark

# The profile file contains the Delta Sharing endpoint URL and credentials
profile_file = "my-share.profile"
client = SharingClient(profile_file)

# List available shares
shares = client.list_shares()
for share in shares:
print(f"Share: {share.name}")
tables = client.list_tables(share)
for table in tables:
print(f" Table: {table.name}")

# Load the shared table as a Spark DataFrame (works on GCP Dataproc or Azure Databricks)
table_url = f"{profile_file}#feature_store_cross_cloud.features.user_churn_features"
df = load_as_spark(table_url)
df.show(5)

Delta Sharing data is read-only on the consumer side. The provider's Databricks workspace serves the data on demand - the consumer never gets direct access to the underlying S3/ADLS/GCS files. This means: no data replication, live data (reads latest version), no egress charges (Delta Sharing uses the provider's storage, and data is pulled via HTTPS, which may or may not incur egress depending on the provider's cloud setup).


Query Federation - Query Without Moving Data

Federation lets you run queries against remote data sources without copying data. The query engine connects to the remote source, pushes down predicates, and retrieves only the result set - typically much smaller than the raw data.

BigQuery Omni - Query S3 from GCP

-- In BigQuery: create an external connection to AWS (configured via Console)
-- Then create an external table pointing to S3

CREATE EXTERNAL TABLE `my-project.ml_features.user_churn_s3`
WITH CONNECTION `us.my-aws-connection`
OPTIONS (
format = 'PARQUET',
uris = ['s3://my-data-lake/iceberg/features/user_churn_features/data/*.parquet']
);

-- Query S3 data from BigQuery (compute runs in AWS region, no GCP egress)
SELECT
event_date,
COUNT(DISTINCT user_id) AS users,
AVG(events_30d) AS avg_events
FROM `my-project.ml_features.user_churn_s3`
WHERE event_date >= '2026-01-01'
GROUP BY event_date
ORDER BY event_date DESC;

Athena Federated Queries - Query GCS or Azure from AWS

# Athena federated queries use Lambda connectors to reach external sources
# The GCS connector reads data from GCS using a Lambda function in the AWS account

import boto3

athena = boto3.client("athena")

# Query data in GCS using Athena's GCS connector
response = athena.start_query_execution(
QueryString="""
SELECT *
FROM "gcs-connector"."my-gcs-bucket"."features/user_churn"
WHERE event_date >= date '2026-01-01'
LIMIT 1000
""",
QueryExecutionContext={"Database": "default"},
ResultConfiguration={"OutputLocation": "s3://my-data-lake/athena-results/"}
)

Trino - Multi-Cloud SQL Federation

Trino (formerly PrestoSQL) is a distributed SQL engine that natively supports federating queries across multiple data sources - S3, GCS, Azure ADLS, Hive, Iceberg, Delta Lake, BigQuery, and traditional databases:

-- Trino: federate a query across S3 (AWS) and GCS (GCP) in a single SQL statement
-- This runs against the Trino coordinator; data is pulled from both clouds and joined in Trino memory

SELECT
a.user_id,
a.events_30d, -- from S3 (AWS Iceberg table)
b.vertex_score -- from GCS (Vertex AI prediction output)
FROM
iceberg.aws_features.user_churn_features a
JOIN gcs.gcp_predictions.vertex_churn_scores b ON a.user_id = b.user_id
WHERE
a.event_date = CURRENT_DATE - INTERVAL '1' DAY
AND b.scored_date = CURRENT_DATE - INTERVAL '1' DAY;

When federation beats replication:

  • Query returns a small result set (the WHERE clause filters aggressively)
  • Data is read infrequently (daily or weekly, not hourly)
  • Data changes frequently (replication lag would cause stale results)
  • The schema evolves rapidly (maintaining replicas requires schema sync)

When replication beats federation:

  • Query scans a large percentage of the data (federation pulls everything over the network)
  • Low latency is required (federation adds round-trip network overhead)
  • The data is queried many times per day (replication pays off with volume)

Multi-Cloud Governance

The hardest part of multi-cloud is not data movement - it is consistent governance. Access control, encryption, audit logging, and data classification policies that work in AWS do not automatically apply in GCP.

# Approach: centralized policy definitions deployed to each cloud
# using infrastructure-as-code (Terraform)

# terraform/data_governance/main.tf (illustrative)
"""
# AWS Lake Formation table permissions
resource "aws_lakeformation_permissions" "ml_team_features" {
principal = "arn:aws:iam::123456789:role/MLEngineerRole"
permissions = ["SELECT"]
table {
database_name = "data_lake"
name = "user_churn_features"
}
}

# GCP BigQuery IAM binding (equivalent permission in GCP)
resource "google_bigquery_table_iam_binding" "ml_team_features" {
dataset_id = "features"
table_id = "user_churn_features"
role = "roles/bigquery.dataViewer"
members = ["group:[email protected]"]
}

# Azure: RBAC binding for ADLS (equivalent in Azure)
resource "azurerm_role_assignment" "ml_team_storage" {
scope = azurerm_storage_account.data_lake.id
role_definition_name = "Storage Blob Data Reader"
principal_id = data.azuread_group.ml_engineers.id
}
"""

The challenge: "ML Engineer" in AWS is an IAM role, in GCP it is a Google group, in Azure it is an AAD group. Keeping these in sync requires an identity federation solution (Azure AD as the identity provider for all three clouds, or a product like Okta/HashiCorp Vault for cloud-agnostic identity).


Data Replication Strategies

When federation is not sufficient, you need to replicate data. Three patterns:

Active-Active - Full Sync Both Ways

Both clouds have the current state at all times. Any write to either cloud is replicated to the other.

Use case: two regional deployments that must each be self-sufficient for disaster recovery.

Cost: 2x storage + 2x the replication compute + egress for every write.

Complexity: conflict resolution when both clouds receive writes simultaneously.

Active-Passive - Replicate for DR Only

Primary cloud receives all writes. Secondary cloud is a read-only replica for disaster recovery - it may be hours behind.

Use case: disaster recovery with RTO measured in hours, not seconds. No active ML workloads on the secondary cloud.

Cost: 2x storage + egress for batch replication jobs (typically daily).

Complexity: low - the secondary is always read-only.

Selective Replication - Copy Only What Each Cloud Needs

Each cloud receives only the data that workloads in that cloud actually need. AWS gets raw events and feature engineering pipelines. GCP gets only the ML training data (a subset, after feature engineering). Azure gets only audit logs and reporting aggregates.

Use case: heterogeneous cloud usage patterns where each cloud serves a different function.

Cost: lowest - only the data each cloud actually consumes is replicated.

Complexity: must define and maintain data contracts between clouds.

# Selective replication: AWS → GCP, only the feature data needed for Vertex AI training
import boto3
from google.cloud import storage
import pyarrow.parquet as pq
import io

def replicate_features_to_gcs(
source_table: str,
target_gcs_path: str,
date_filter: str
) -> int:
"""
Replicate only the latest feature partition from S3 (Iceberg/Parquet) to GCS.
Called once after each daily feature engineering run.
"""
# Read from AWS Athena (feature table in S3)
athena = boto3.client("athena")
response = athena.start_query_execution(
QueryString=f"""
SELECT user_id, events_7d, events_30d, revenue_30d, days_since_purchase, churned
FROM {source_table}
WHERE event_date = DATE '{date_filter}'
""",
ResultConfiguration={"OutputLocation": "s3://my-data-lake/athena-results/"}
)

# Wait for completion and download result
# ... (polling code omitted for brevity)

# Upload result to GCS for Vertex AI consumption
gcs_client = storage.Client()
bucket = gcs_client.bucket("my-gcs-training-data")
blob = bucket.blob(f"{target_gcs_path}/{date_filter}/features.parquet")

# In production: stream the Athena result directly to GCS
# This example: simplified path
with open(f"/tmp/features_{date_filter}.parquet", "rb") as f:
blob.upload_from_file(f)

print(f"Replicated features for {date_filter} to GCS")
return 1

When to Go Single-Cloud

Multi-cloud has real costs: operational complexity, governance overhead, egress charges, and the cognitive overhead of context-switching between three different cloud consoles, three different billing systems, and three different security models.

The honest assessment:

Multi-cloud adds value when:
- You are legally required to (data residency regulations)
- You acquired a company on a different cloud and migration cost > 3-year multi-cloud cost
- A best-of-breed service on another cloud saves significantly more than the complexity costs
- You have genuine DR requirements that a single-cloud multi-region setup cannot meet

Multi-cloud is expensive when:
- You are doing it to "avoid vendor lock-in" without concrete data showing lock-in risk
- The services you use on each cloud largely overlap (you could use one cloud's equivalent)
- Your team does not have deep expertise in all the clouds you're using
- Egress costs between clouds exceed the premium-service savings that drove you there

Open formats as the single-cloud answer: If you store all data in Apache Iceberg or Delta Lake on S3, you are not locked in - any query engine can read your data. You can migrate to GCP by pointing a Dataproc cluster at your S3 Iceberg tables while you stand up the new infrastructure. You can add Databricks or Athena without re-ingesting data. Open format is the real vendor lock-in protection, not multi-cloud.

-- Check: can you read your data with 3 different engines?
-- If yes, you have open format portability without multi-cloud complexity

-- Engine 1: Athena (AWS serverless SQL)
SELECT COUNT(*) FROM iceberg_catalog.features.user_churn_features;

-- Engine 2: Databricks (any cloud)
SELECT COUNT(*) FROM iceberg.`s3://my-data-lake/iceberg/features/user_churn_features`;

-- Engine 3: Trino (self-hosted, cloud-agnostic)
SELECT COUNT(*) FROM iceberg.features.user_churn_features;

If all three work - you have de facto multi-cloud portability for your data without paying multi-cloud egress costs or operational complexity.


:::danger Egress Costs Are Not in the Architecture Document The biggest failure mode in multi-cloud architecture is that egress costs appear on the cloud bill 30-60 days after an architecture decision is made. Nobody listed "data transfer" as a line item in the cost model. The solution: build the egress cost model (as shown earlier in this lesson) before any multi-cloud architecture is approved. Make the egress cost explicit in every architecture review document. :::

:::warning Cross-Cloud Latency Affects ML Inference If your ML model is served on GCP (Vertex AI) but needs to read features from an AWS DynamoDB online store at inference time, the cross-cloud round-trip adds 30-100ms to every prediction. For latency-sensitive applications (recommendation, fraud detection, ad serving), this is unacceptable. Features must be co-located with the serving infrastructure. Design for this from the start - either replicate features to the serving cloud or choose a single serving cloud. :::


Interview Q&A

Q1: What is data gravity and why is it the central problem in multi-cloud data architectures?

Data gravity is the phenomenon where large datasets attract compute, applications, and services toward themselves - similar to how mass creates gravitational pull. The larger the dataset, the more expensive and disruptive it becomes to move it.

In cloud architectures, data gravity manifests as egress costs. When you have 50 TB of user event data in S3, moving it to GCS for ML training costs $4,500 in egress. Running the training job in AWS (where the data lives) costs nothing in egress. So even if GCP's GPU pricing is 20% cheaper than AWS, the egress cost of frequent data transfers may more than offset the compute savings.

Data gravity creates lock-in through economic friction: it is not that you cannot move the data, it is that moving the data costs enough that you keep choosing the cloud where the data already lives. The solution is either: (1) co-locate compute with data (accept that ML training happens in AWS because data is there), (2) replicate data once and keep both copies current (pay once, then save on per-experiment egress), (3) use federation to query data remotely without moving it, or (4) use open formats and accept that your architecture is designed for a future migration, not for daily cross-cloud movement.


Q2: A company wants to use Vertex AI (GCP) for ML but keeps raw data in AWS S3. Design the architecture to minimize cost and complexity.

The right architecture depends on how often training data changes and how many experiments run per month.

If training data is relatively static and updated monthly: replicate the training dataset to GCS once per month after each update. Monthly replication of 5 TB costs $450/month in egress - far cheaper than per-experiment egress. Use GCS as the authoritative ML training data store for GCP workloads. Keep raw data in S3 as the system of record.

If training data changes daily and experiments run frequently: run feature engineering in AWS (Glue or EMR on S3 data), produce a feature dataset, and replicate only the latest feature snapshot to GCS daily. The feature dataset is much smaller than raw data - typically 1-10% of the size. Daily replication of a 100 GB feature snapshot costs 9/day=9/day = 270/month - manageable.

For both patterns: standardize on Apache Iceberg as the table format. Iceberg metadata can be replicated to GCS while data files remain in S3. GCP Spark jobs read metadata from GCS and data files directly from S3 (requires S3 access from GCP service accounts via IAM). This avoids full data replication - only the metadata layer (tiny) is synchronized.

The governance layer: use Unity Catalog (if Databricks is in the stack) or a policy-as-code approach (Terraform-managed IAM policies in both clouds) to maintain consistent access control.


Q3: Explain Apache Iceberg's role in enabling multi-cloud without vendor lock-in.

Apache Iceberg is an open table format - it defines how data files and metadata are organized, not which engine reads them or which cloud stores them. An Iceberg table stored on S3 is readable by Athena, Spark on EMR, Databricks, Trino, BigQuery (via BigLake), Snowflake (external tables), and Flink - without any conversion or proprietary translation layer.

Iceberg's architecture has two layers: data files (Parquet or ORC in object storage) and a metadata layer (JSON files describing the table schema, partition spec, and which data files constitute each snapshot). The metadata is the key innovation - it enables time travel (each snapshot references the exact set of data files that constituted the table at that moment), ACID transactions (atomic commits write a new metadata file), and schema evolution (tracked in the metadata without rewriting data files).

For multi-cloud, the practical implication: you can store all data in S3 (where it is cheapest and where most raw data arrives), define it as an Iceberg table, and query it from GCP Dataproc, Azure Databricks, or your laptop via Trino without moving any data. The portability comes from the open format - not from physically putting data in multiple clouds.


Q4: When should you use query federation versus data replication in a multi-cloud architecture?

The decision comes down to four factors: query selectivity, query frequency, result size, and acceptable latency.

Use federation when: the query is highly selective (WHERE clause filters to less than 1% of data), query frequency is low (daily or weekly), acceptable latency is seconds (not milliseconds), and data changes frequently (replication lag would cause stale results). Federation moves only the query result set across the network - if the result is 1 MB from a 1 TB source, you paid for 1 MB of network, not 1 TB.

Use replication when: the query scans a large percentage of the source data (federation would stream most of the data across the network anyway), query frequency is high (hundreds of queries per day), low latency is required (federation adds 50-200ms network overhead), and data changes infrequently (weekly or monthly). Replication pays the egress cost once, then serves all subsequent queries locally.

The break-even analysis: if the daily query volume * result bytes > replication cost, replicate. If daily query volume * result bytes less than replication cost, federate. For most ML training scenarios (large datasets, read once per training run, high selectivity on date range), replication wins. For ad hoc analytics (small result sets, variable access patterns), federation wins.


Q5: A company is experiencing $90,000/year in AWS-to-GCP egress costs for ML experiments. What are the three solutions and their trade-offs?

Solution 1: Selective replication. After each feature engineering run (daily), replicate the latest feature partition from S3 to GCS. The feature partition is roughly 100 GB (after aggregation from raw data). Replication cost: 9/day=9/day = 3,300/year. GCS storage for 90 days of features: 18/month=18/month = 216/year. Total: ~3,500/year.Savings:3,500/year. Savings: 86,500/year. Trade-off: 24-hour data lag (training data is always yesterday's features, not real-time).

Solution 2: Move ML training to AWS. Use SageMaker instead of Vertex AI. No egress cost (data stays in AWS). Trade-off: if Vertex AI's ML capabilities were the reason for GCP (better AutoML, better TPU access, better pre-trained model ecosystem), this sacrifices the capability advantage that motivated the multi-cloud decision in the first place.

Solution 3: Federated query. Use BigQuery Omni or Vertex AI's cross-cloud connectors to query S3 data from GCP without copying it. Egress cost: $0 (or minimal, depending on configuration). Trade-off: federated query latency is 2-5x higher than reading from GCS. For interactive exploration this is acceptable; for production training pipelines that must complete in a fixed window, the latency may be a blocker.

The practical recommendation: Solution 1 (selective replication) is almost always the right answer. It eliminates 96% of the egress cost (90K90K → 3.5K), introduces only a 24-hour lag that most ML workflows can tolerate, and requires a single 50-line script to implement. Solutions 2 and 3 involve larger architectural changes or capability trade-offs.


Q6: What is Delta Sharing and how does it differ from replicating Delta Lake data across clouds?

Delta Sharing is an open protocol for sharing live Delta Lake (or Iceberg) data across organizations and clouds without copying data. The data remains in the provider's storage (S3, ADLS, GCS). The consumer reads from the provider's Delta Sharing endpoint - they see the current data without any replication lag.

The mechanism: the provider runs a Delta Sharing server (managed by Databricks Unity Catalog, or self-hosted). The server exposes a REST API. Consumers authenticate with a token and issue queries through the API. The server reads data files from the provider's object storage and streams results to the consumer.

Compared to replication: replication creates a full copy of the data in the consumer's cloud - high egress cost upfront, then zero-latency local queries. Delta Sharing has no upfront egress cost but adds API call overhead (30-100ms) for every read. For infrequent reads (training a model once a week), Delta Sharing is cheaper than replication. For frequent reads (BI tools querying every minute), replication is cheaper because the per-call overhead compounds.

Delta Sharing also provides governance advantages: the provider controls access at the row and column level through the sharing server. Revoke access by removing the recipient - no data to delete from the consumer's cloud, no stale copies left behind. Replication gives the consumer a copy they control indefinitely - you cannot "un-share" replicated data.

© 2026 EngineersOfAI. All rights reserved.