What is aws data engineering?

S3, Glue, Athena, EMR, and the AWS data engineering ecosystem.

How does amazon s3 work in practice?

AWS Data Services covers aws data engineering, amazon s3, aws glue from first principles with code examples. Free lesson at https://engineersofai.com/docs/data-engineering/cloud-data-platforms/aws-data-services

What is the difference between aws data engineering and aws glue?

See the full breakdown at https://engineersofai.com/docs/data-engineering/cloud-data-platforms/aws-data-services

:::tip 🎮 Interactive Playground Visualize this concept: Try the Data Lakehouse Architecture demo on the EngineersOfAI Playground - no code required. :::

AWS Data Services

An AWS-native company used 11 different data services: S3, Glue, Athena, EMR, Redshift, Kinesis, MSK, DynamoDB, SageMaker, Lake Formation, and Step Functions. Understanding how they fit together took a new hire three months. The first month was just figuring out that Glue, Athena, and EMR could all run the same SQL query but for completely different use cases, price points, and operational models. The second month was understanding that Redshift Spectrum and Athena both query S3 but are designed for entirely different workloads. The third month was Lake Formation - which governs all of it.

This lesson is that three-month education in one hour. It maps the AWS data ecosystem from a data engineer's perspective, explains what each service is for, when to use it, and how these services compose into a complete ML data platform.

Why This Exists - The Cloud Data Puzzle

AWS built data services organically over 15 years - each solving a specific problem that existed at a specific point in time. S3 (2006) solved cheap, durable storage. EMR (2009) made Hadoop accessible without managing clusters. Redshift (2012) brought columnar MPP warehousing to the cloud. Glue (2017) added serverless ETL. Athena (2016) added serverless SQL on S3. SageMaker (2017) added managed ML. Lake Formation (2019) added governance.

The result is a powerful ecosystem where the right combination of services can handle any data and ML workload - but the breadth is also confusing. There is no single AWS "data platform" - there is a set of composable services that you must architect correctly.

The mental model: S3 is the universal data substrate. Everything else connects to S3. Glue reads from S3 and writes to S3. Athena queries S3. Redshift Spectrum queries S3. EMR processes data in S3. SageMaker reads training data from S3 and writes model artifacts to S3. Lake Formation governs access to S3. Once you understand S3 as the center of gravity, the rest of the ecosystem makes sense.

The AWS Data Landscape

Amazon S3 - The Lakehouse Foundation

S3 is not a database or a query engine. It is an object storage system - infinitely scalable, 99.999999999% (11 nines) durable, and priced at $0.023/GB/month for standard storage. It is the foundation of every AWS data architecture.

Storage Classes and Lifecycle Policies

Not all data needs the same storage tier:

Storage Class	Access Pattern	Cost ($/GB/month)	Retrieval
S3 Standard	Frequent access	$0.023	Instant
S3 Intelligent-Tiering	Unknown pattern	$0.023 + monitoring fee	Instant
S3 Standard-IA	Infrequent access	$0.0125	Instant
S3 Glacier Instant Retrieval	Archives, quarterly	$0.004	Instant
S3 Glacier Flexible Retrieval	Archives	$0.0036	Minutes–hours
S3 Glacier Deep Archive	Long-term compliance	$0.00099	Hours

import boto3

s3_client = boto3.client("s3")

# Set a lifecycle policy: automatically tier data by age
lifecycle_config = {
    "Rules": [
        {
            "ID": "ml-data-tiering",
            "Status": "Enabled",
            "Filter": {"Prefix": "training-data/"},
            "Transitions": [
                {"Days": 30,  "StorageClass": "STANDARD_IA"},      # after 30 days
                {"Days": 90,  "StorageClass": "GLACIER_IR"},         # after 90 days
                {"Days": 365, "StorageClass": "DEEP_ARCHIVE"}        # after 1 year
            ],
            "Expiration": {"Days": 2555}  # delete after 7 years
        }
    ]
}

s3_client.put_bucket_lifecycle_configuration(
    Bucket="my-data-lake",
    LifecycleConfiguration=lifecycle_config
)

S3 as an Iceberg Lakehouse

S3 + Apache Iceberg is the most common modern lakehouse pattern on AWS. Iceberg metadata lives in S3 alongside data files, enabling ACID transactions and schema evolution on raw S3 storage:

# Writing Iceberg tables to S3 using PyIceberg
from pyiceberg.catalog import load_catalog
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, StringType, LongType, TimestampType
import pyarrow as pa

catalog = load_catalog("glue", **{
    "type": "glue",
    "s3.region": "us-east-1"
})

# Create an Iceberg table backed by S3
schema = Schema(
    NestedField(1, "user_id", LongType(), required=True),
    NestedField(2, "event_type", StringType(), required=True),
    NestedField(3, "revenue", pa.float64(), required=False),
    NestedField(4, "event_timestamp", TimestampType(), required=True)
)

table = catalog.create_table(
    "data_lake.user_events",
    schema=schema,
    location="s3://my-data-lake/user-events/"
)

AWS Glue - Serverless ETL and the Data Catalog

AWS Glue is two things that happen to share a name: a serverless ETL service (run Spark jobs without managing clusters) and a Data Catalog (a centralized metadata repository compatible with the Hive metastore).

The Glue Data Catalog

The Glue Data Catalog stores table schemas, partition information, and connection properties for all your data sources. Athena, EMR, and Redshift Spectrum all read from the same catalog - define a table once, query it from any service.

import boto3

glue_client = boto3.client("glue")

# Register a table in the Glue Catalog pointing to Iceberg data in S3
glue_client.create_table(
    DatabaseName="data_lake",
    TableInput={
        "Name": "user_events",
        "StorageDescriptor": {
            "Columns": [
                {"Name": "user_id",          "Type": "bigint"},
                {"Name": "event_type",       "Type": "string"},
                {"Name": "revenue",          "Type": "double"},
                {"Name": "event_timestamp",  "Type": "timestamp"}
            ],
            "Location": "s3://my-data-lake/user-events/",
            "InputFormat": "org.apache.iceberg.mr.mapred.IcebergInputFormat",
            "OutputFormat": "org.apache.iceberg.mr.mapred.IcebergOutputFormat",
            "SerdeInfo": {
                "SerializationLibrary": "org.apache.iceberg.mr.hive.HiveIcebergSerDe"
            }
        },
        "PartitionKeys": [],
        "TableType": "EXTERNAL_TABLE",
        "Parameters": {
            "table_type": "ICEBERG",
            "metadata_location": "s3://my-data-lake/user-events/metadata/v1.metadata.json"
        }
    }
)

Glue ETL Jobs - Serverless PySpark

Glue ETL jobs run PySpark code without you managing any cluster. You specify worker type (G.1X = 4 vCPU, G.2X = 8 vCPU), worker count, and Glue provisions, runs, and terminates the cluster automatically:

# Glue ETL script: raw events → feature table
# This runs inside a Glue job - DynamicFrames are Glue's schema-flexible alternative to DataFrames

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql import functions as F
from pyspark.sql.window import Window

args = getResolvedOptions(sys.argv, ["JOB_NAME", "input_database", "output_path"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# Read from Glue Catalog (Iceberg table on S3)
raw_events = glueContext.create_dynamic_frame.from_catalog(
    database=args["input_database"],
    table_name="user_events"
).toDF()

# Feature engineering
window_7d = Window.partitionBy("user_id").orderBy("event_date").rowsBetween(-6, 0)
window_30d = Window.partitionBy("user_id").orderBy("event_date").rowsBetween(-29, 0)

features = (
    raw_events
    .groupBy("user_id", F.to_date("event_timestamp").alias("event_date"))
    .agg(
        F.count("*").alias("daily_events"),
        F.sum("revenue").alias("daily_revenue")
    )
    .withColumn("events_7d",  F.sum("daily_events").over(window_7d))
    .withColumn("events_30d", F.sum("daily_events").over(window_30d))
    .withColumn("revenue_30d", F.sum("daily_revenue").over(window_30d))
)

# Write to S3 in Parquet format, partitioned by date
(
    features.write
    .mode("overwrite")
    .partitionBy("event_date")
    .parquet(args["output_path"])
)

job.commit()

Glue DataBrew - No-Code Data Quality

AWS Glue DataBrew provides a visual interface for data profiling and transformation - no code required. For data quality checks before feature engineering:

import boto3

databrew = boto3.client("databrew")

# Create a DataBrew profile job to assess data quality
databrew.create_profile_job(
    Name="user-events-quality-profile",
    DatasetName="user-events-raw",
    OutputLocation={"Bucket": "my-data-lake", "Key": "quality-reports/"},
    RoleArn="arn:aws:iam::123456789:role/GlueDataBrewRole",
    JobSample={"Mode": "CUSTOM_ROWS", "Size": 1000000},
    Configuration={
        "DatasetStatisticsConfiguration": {
            "IncludedStatistics": ["MEAN", "MEDIAN", "STD", "MIN", "MAX",
                                   "UNIQUE_VALUE_COUNT", "MISSING_VALUE_COUNT"]
        }
    }
)

Amazon Athena - Serverless SQL on S3

Athena is a serverless query service. You point it at S3, write SQL, and get results. No cluster to start, no capacity to provision. You pay $5 per TB of data scanned.

import boto3, time

athena = boto3.client("athena", region_name="us-east-1")

# Run an Athena query
response = athena.start_query_execution(
    QueryString="""
        SELECT
            user_id,
            SUM(daily_events)  AS total_events_30d,
            SUM(daily_revenue) AS total_revenue_30d
        FROM data_lake.user_features
        WHERE event_date >= DATE_ADD('day', -30, CURRENT_DATE)
        GROUP BY user_id
        HAVING SUM(daily_revenue) > 0
    """,
    QueryExecutionContext={"Database": "data_lake"},
    ResultConfiguration={
        "OutputLocation": "s3://my-data-lake/athena-results/"
    },
    WorkGroup="primary"
)

query_execution_id = response["QueryExecutionId"]

# Poll until complete
while True:
    status = athena.get_query_execution(QueryExecutionId=query_execution_id)
    state = status["QueryExecution"]["Status"]["State"]
    if state in ("SUCCEEDED", "FAILED", "CANCELLED"):
        break
    time.sleep(2)

if state == "SUCCEEDED":
    result = athena.get_query_results(QueryExecutionId=query_execution_id)
    print(f"Rows returned: {len(result['ResultSet']['Rows']) - 1}")  # -1 for header
else:
    reason = status["QueryExecution"]["Status"]["StateChangeReason"]
    raise Exception(f"Query failed: {reason}")

Athena Federated Queries

Athena can query sources beyond S3 - RDS, DynamoDB, Redshift, and custom data sources - using Lambda-based data source connectors:

-- Query DynamoDB from Athena (via the DynamoDB connector)
SELECT
    u.user_id,
    u.subscription_tier,
    COUNT(e.event_id) AS recent_events
FROM "lambda:dynamodb".users u
JOIN data_lake.user_events e ON e.user_id = u.user_id
WHERE e.event_date >= DATE_ADD('day', -7, CURRENT_DATE)
GROUP BY u.user_id, u.subscription_tier;

Athena for ML Feature Exploration

Athena is ideal for exploratory feature analysis before committing to a feature engineering pipeline:

-- Explore feature distributions before building the pipeline
SELECT
    subscription_tier,
    COUNT(*)                                        AS user_count,
    APPROX_PERCENTILE(events_30d, 0.5)              AS median_events,
    APPROX_PERCENTILE(events_30d, 0.95)             AS p95_events,
    APPROX_PERCENTILE(revenue_30d, 0.5)             AS median_revenue,
    AVG(CAST(churned AS DOUBLE))                    AS churn_rate
FROM data_lake.user_features
WHERE event_date = DATE_ADD('day', -1, CURRENT_DATE)
GROUP BY subscription_tier
ORDER BY churn_rate DESC;

Amazon Redshift - Columnar MPP Warehouse

Redshift is AWS's managed columnar data warehouse. Unlike Athena (serverless per-query), Redshift runs always-on clusters (or Serverless) optimized for fast, repeated queries by analyst teams and BI tools.

When Redshift vs. Athena

Consideration	Athena	Redshift
Cluster management	None	Managed (or Serverless)
Query latency	Seconds–minutes	Sub-second for warm queries
Always-on cost	None (pay per query)	Hourly cluster cost
Concurrent users	Scales automatically	Concurrency scaling slots
Best for	Ad hoc, infrequent large scans	BI dashboards, many concurrent users
Format support	Any (Parquet, ORC, CSV, Iceberg)	Redshift-native + Spectrum for S3

Redshift Spectrum - Query S3 from Redshift

Redshift Spectrum extends Redshift queries to S3 data without loading it into Redshift:

-- Create an external schema pointing to the Glue Catalog
CREATE EXTERNAL SCHEMA data_lake_external
FROM DATA CATALOG
DATABASE 'data_lake'
IAM_ROLE 'arn:aws:iam::123456789:role/RedshiftS3Role'
REGION 'us-east-1';

-- Join Redshift-native tables with S3/Iceberg data via Spectrum
SELECT
    r.user_id,
    r.segment,               -- from Redshift internal table
    s.events_30d,            -- from S3 via Spectrum
    s.revenue_30d
FROM redshift_schema.user_segments r
JOIN data_lake_external.user_features s ON r.user_id = s.user_id
WHERE s.event_date = DATEADD(day, -1, CURRENT_DATE);

Redshift ML - SageMaker Autopilot Integration

Redshift ML lets you train and run models using SQL, delegating to SageMaker Autopilot behind the scenes:

-- Train a model using SageMaker Autopilot from Redshift SQL
CREATE MODEL churn_predictor
FROM (
    SELECT events_7d, events_30d, revenue_30d, days_since_purchase,
           account_age_days, subscription_tier, churned
    FROM features.user_churn_features
    WHERE event_date >= DATEADD(day, -90, CURRENT_DATE)
)
TARGET churned
FUNCTION predict_churn
IAM_ROLE 'arn:aws:iam::123456789:role/RedshiftSageMakerRole'
AUTO ON
SETTINGS (
    S3_BUCKET 'my-data-lake'
);

-- Use the model for inference
SELECT user_id, predict_churn(events_7d, events_30d, revenue_30d,
                               days_since_purchase, account_age_days, subscription_tier)
       AS churn_prediction
FROM features.user_churn_features
WHERE event_date = CURRENT_DATE;

Amazon EMR - Managed Spark and Flink

EMR is managed Spark, Flink, Hive, and Presto - you get the full open-source ecosystem on AWS-managed infrastructure without manually configuring clusters.

EMR Serverless - No Cluster Management

EMR Serverless removes cluster management entirely. Submit a Spark job, EMR provisions workers, runs the job, terminates workers:

import boto3

emr_serverless = boto3.client("emr-serverless", region_name="us-east-1")

# Create a serverless application
response = emr_serverless.create_application(
    name="ml-feature-pipeline",
    releaseLabel="emr-6.15.0",
    type="SPARK",
    autoStartConfiguration={"enabled": True},
    autoStopConfiguration={"enabled": True, "idleTimeoutMinutes": 5},
    initialCapacity={
        "DRIVER": {
            "workerCount": 1,
            "workerConfiguration": {"cpu": "4vCPU", "memory": "16GB"}
        },
        "EXECUTOR": {
            "workerCount": 10,
            "workerConfiguration": {"cpu": "4vCPU", "memory": "16GB"}
        }
    }
)
application_id = response["applicationId"]

# Submit a Spark job
job_response = emr_serverless.start_job_run(
    applicationId=application_id,
    executionRoleArn="arn:aws:iam::123456789:role/EMRServerlessRole",
    jobDriver={
        "sparkSubmit": {
            "entryPoint": "s3://my-data-lake/scripts/feature_engineering.py",
            "entryPointArguments": [
                "--input-path", "s3://my-data-lake/raw/user-events/",
                "--output-path", "s3://my-data-lake/features/",
                "--feature-date", "2026-03-11"
            ],
            "sparkSubmitParameters": (
                "--conf spark.executor.cores=4 "
                "--conf spark.executor.memory=14g "
                "--conf spark.executor.instances=20 "
                "--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
            )
        }
    },
    configurationOverrides={
        "monitoringConfiguration": {
            "s3MonitoringConfiguration": {
                "logUri": "s3://my-data-lake/emr-logs/"
            }
        }
    }
)
print(f"Job run ID: {job_response['jobRunId']}")

AWS Lake Formation - Centralized Governance

Lake Formation is the governance layer for the AWS data lake. It sits on top of S3 and the Glue Catalog, providing fine-grained access control, column masking, row-level filtering, and cross-account data sharing.

import boto3

lf_client = boto3.client("lakeformation", region_name="us-east-1")

# Grant column-level access (ML team can see user_id and features but not PII)
lf_client.grant_permissions(
    Principal={"DataLakePrincipalIdentifier": "arn:aws:iam::123456789:role/MLEngineerRole"},
    Resource={
        "TableWithColumns": {
            "DatabaseName": "data_lake",
            "Name": "user_features",
            "ColumnNames": [
                "user_id", "events_7d", "events_30d",
                "revenue_30d", "days_since_purchase", "churned"
            ]
            # PII columns (email, phone, name) are NOT included
        }
    },
    Permissions=["SELECT"]
)

# Grant row-level filter (Analyst team can only see free-tier users)
lf_client.create_data_cells_filter(
    TableData={
        "TableCatalogId": "123456789",
        "DatabaseName": "data_lake",
        "TableName": "user_features",
        "Name": "free_tier_only",
        "RowFilter": {
            "FilterExpression": "subscription_tier = 'free'"
        },
        "ColumnWildcard": {}  # all columns
    }
)

# Share a table with another AWS account (similar to Snowflake Data Sharing)
lf_client.create_lake_formation_opt_in(
    Principal={"DataLakePrincipalIdentifier": "111122223333"},  # consumer AWS account
    Resource={
        "Table": {
            "DatabaseName": "data_lake",
            "Name": "user_features"
        }
    }
)

SageMaker Feature Store

SageMaker Feature Store is a managed feature store with both an online store (low-latency, DynamoDB-backed for real-time serving) and an offline store (S3-backed for training):

import boto3
import sagemaker
from sagemaker.feature_store.feature_group import FeatureGroup
from sagemaker.feature_store.feature_definition import (
    FeatureDefinition, FeatureTypeEnum
)
import pandas as pd
from datetime import datetime, timezone

session = sagemaker.Session()
role = sagemaker.get_execution_role()

# Define feature group
feature_group = FeatureGroup(
    name="user-churn-features",
    sagemaker_session=session
)

# Define features
feature_definitions = [
    FeatureDefinition(feature_name="user_id",              feature_type=FeatureTypeEnum.INTEGRAL),
    FeatureDefinition(feature_name="events_7d",            feature_type=FeatureTypeEnum.INTEGRAL),
    FeatureDefinition(feature_name="events_30d",           feature_type=FeatureTypeEnum.INTEGRAL),
    FeatureDefinition(feature_name="revenue_30d",          feature_type=FeatureTypeEnum.FRACTIONAL),
    FeatureDefinition(feature_name="days_since_purchase",  feature_type=FeatureTypeEnum.INTEGRAL),
    FeatureDefinition(feature_name="event_time",           feature_type=FeatureTypeEnum.STRING),  # ISO 8601
]

feature_group.feature_definitions = feature_definitions

# Create the feature group (online + offline)
feature_group.create(
    s3_uri=f"s3://my-data-lake/sagemaker-feature-store/",
    record_identifier_name="user_id",
    event_time_feature_name="event_time",
    role_arn=role,
    enable_online_store=True,
    online_store_config={
        "EnableOnlineStore": True,
        "TtlDuration": {"Unit": "Days", "Value": 7}  # evict features older than 7 days
    }
)

# Ingest features into the feature group
features_df = pd.DataFrame({
    "user_id": [1001, 1002, 1003],
    "events_7d": [45, 12, 0],
    "events_30d": [180, 55, 3],
    "revenue_30d": [249.97, 49.99, 0.0],
    "days_since_purchase": [2, 15, 120],
    "event_time": [datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")] * 3
})

feature_group.ingest(
    data_frame=features_df,
    max_workers=4,
    wait=True
)
print("Features ingested successfully")

# Real-time feature retrieval for online serving (from the online store)
featurestore_runtime = boto3.client("sagemaker-featurestore-runtime")

response = featurestore_runtime.get_record(
    FeatureGroupName="user-churn-features",
    RecordIdentifierValueAsString="1001"
)
features = {f["FeatureName"]: f["ValueAsString"]
            for f in response["Record"]}
print(f"User 1001 features: {features}")

Reference Architecture - End-to-End ML Pipeline on AWS

:::danger Athena and Glue Catalog Are Eventually Consistent Glue Catalog metadata updates can take a few seconds to propagate. If you write data to S3 and immediately query it via Athena, you may see stale partition metadata - especially after adding new partitions. Always call MSCK REPAIR TABLE or use Glue's batch_create_partition API after writing new partitions.

-- After writing new data partitions to S3, repair the table
MSCK REPAIR TABLE data_lake.user_features;

:::

:::warning EMR Cluster Startup Time EMR cluster startup takes 3–8 minutes. For latency-sensitive jobs, use EMR Serverless with pre-initialized capacity or keep a persistent cluster warm. Scheduling a 10-minute Spark job every 15 minutes on a cold EMR cluster wastes 50% of the runtime on cluster provisioning. :::

Interview Q&A

Q1: A new data engineer joins a company using AWS. They see Athena, Glue, and Redshift all querying the same S3 data. Why would you use three different services for the same data?

Each service has a different optimization target.

Athena is serverless SQL - no cluster, no setup, pay per TB scanned. It is optimal for ad hoc queries by analysts who run queries infrequently and cannot justify keeping a cluster warm. A data scientist who runs 5 queries per week on terabytes of raw data pays cents and maintains zero infrastructure.

Glue is serverless PySpark ETL - not a query service, but a transformation service. You use it to run Spark jobs that transform raw S3 data into feature tables, then write the results back to S3. Glue uses the same Catalog as Athena - both see the same table definitions.

Redshift is a persistent MPP data warehouse. It is optimal for concurrent BI users running dashboards - 50 analysts all hitting a Looker dashboard simultaneously would overwhelm Athena's per-query queuing and exceed practical cost targets. Redshift keeps hot data in local SSD storage for sub-second query times. Redshift Spectrum extends to S3 when you need to join with data that is not loaded into Redshift.

The short answer: Athena for ad hoc exploration (no ops, per-query cost), Glue for transformation (serverless Spark), Redshift for concurrent BI users (persistent warehouse, fast repeated queries).

Q2: What is the Glue Data Catalog and why is it important for the AWS data ecosystem?

The Glue Data Catalog is a Hive-compatible metadata store that holds database and table definitions for your S3 data. It stores schema information (column names, types), partition metadata (which partition keys exist), and connection properties.

Its importance is interoperability: the same table definition in the Glue Catalog can be read by Athena, EMR, Redshift Spectrum, and Glue ETL jobs. Define a table once - query it from any compute service. Without the Catalog, each service would need its own schema management, creating schema drift and consistency problems.

Lake Formation sits on top of the Catalog to add access control - you grant permissions at the Catalog level, and they apply uniformly across all services that read from the Catalog. This is the foundation of the AWS data governance model.

Q3: Explain the difference between the SageMaker Feature Store's online store and offline store. When would you use each?

The offline store is S3-backed - it stores all historical feature records, partitioned by event time. It is optimized for high-throughput batch reads: reading millions of records for model training. Latency is seconds. Used for: generating training datasets, backfilling features, batch scoring jobs.

The online store is DynamoDB-backed - it stores the most recent feature value for each entity. It is optimized for low-latency single-record reads: retrieving features for one user during a real-time prediction request. Latency is milliseconds. Used for: online serving endpoints that need features at prediction time.

The key insight: you write features once (using the Feature Group's put_record or ingest API) and they are automatically persisted to both stores. At training time, you use the offline store to build a consistent, point-in-time correct training dataset. At serving time, the prediction endpoint reads from the online store for sub-10ms feature retrieval.

Q4: A Spark job running on EMR takes 3 hours. A new engineer suggests switching to Glue. What questions would you ask before making that recommendation?

First: what is the startup overhead? If the job is already running 3 hours on an EMR cluster, the 3-7 minute cluster startup is negligible. If the job ran every 30 minutes on EMR Serverless and Glue both have cold start penalties, the comparison changes.

Second: does the job use any EMR-specific configurations? EMR supports custom Spark configurations, specific Spark versions, and integration with other open-source tools (Hudi, Delta, Iceberg) that Glue may lag behind on version-wise.

Third: what is the actual bottleneck? If the job is I/O-bound (reading 10 TB from S3), the execution engine matters less than the cluster size. Moving to Glue without addressing the I/O pattern will not change the 3-hour runtime.

Fourth: what is the cost comparison? EMR on-demand clusters cost more per compute-hour than Glue's DPU pricing for short jobs, but for a 3-hour job with a large cluster, EMR Spot instances might be cheaper than Glue.

The recommendation is not always "use Glue." Glue is simpler to operate but less flexible. EMR gives you full Spark configurability and access to spot instances. For a 3-hour production job, EMR Serverless with spot instances is often the right balance.

Q5: How does Lake Formation improve on S3 bucket policies and IAM roles for data access control?

S3 bucket policies and IAM roles operate at the file level - you can grant or deny access to entire prefixes (folders). This is too coarse for data governance: granting a data scientist access to s3://bucket/features/ gives them access to all columns in all tables, including PII.

Lake Formation operates at the table and column level through the Glue Catalog. You can grant SELECT access to specific columns of a specific table, independently of the underlying S3 file structure. You can apply row-level filters so different teams see different subsets of the same table. You can grant access to one table in a database without exposing any other tables.

The mechanism: Lake Formation wraps Glue Catalog resources with its own permission model. When Athena or EMR queries a table, it goes through Lake Formation's access control before the Catalog reveals partition metadata and S3 locations. If permission is denied, the query never even learns where the data lives in S3.

This enables a data mesh model: a central Platform team manages the Glue Catalog and Lake Formation policies; domain teams are granted precisely-scoped access to only the data they need; cross-account sharing is handled by Lake Formation's resource sharing model rather than complex cross-account S3 bucket policies.

Why This Exists - The Cloud Data Puzzle​

The AWS Data Landscape​

Amazon S3 - The Lakehouse Foundation​

Storage Classes and Lifecycle Policies​

S3 as an Iceberg Lakehouse​

AWS Glue - Serverless ETL and the Data Catalog​

The Glue Data Catalog​

Glue ETL Jobs - Serverless PySpark​

Glue DataBrew - No-Code Data Quality​

Amazon Athena - Serverless SQL on S3​

Athena Federated Queries​

Athena for ML Feature Exploration​

Amazon Redshift - Columnar MPP Warehouse​

When Redshift vs. Athena​

Redshift Spectrum - Query S3 from Redshift​

Redshift ML - SageMaker Autopilot Integration​

Amazon EMR - Managed Spark and Flink​

EMR Serverless - No Cluster Management​

AWS Lake Formation - Centralized Governance​

Cross-Account Data Sharing with Lake Formation​

SageMaker Feature Store​

Reference Architecture - End-to-End ML Pipeline on AWS​

Interview Q&A​