Cloud vs On-Prem GPU Infrastructure

The $50 Million Question

It is budget season. Your team has been running training jobs on AWS p4d.24xlarge instances - 8 A100 40GB GPUs per instance, roughly $32/hour on-demand. Over the past year you have averaged 400 GPU-hours per day: some days 200, some days 800, always variable. Your annual AWS GPU bill is around$ 4.7 million. A vendor just presented you with an alternative: a 10-node DGX A100 cluster, 8 GPUs per node, 80 nodes total at $2.5 million in hardware, plus$ 800k for networking (InfiniBand HDR), $300k for power infrastructure,$ 200k for installation, and $500k/year in operations, power, and maintenance. Total first-year cost:$ 3.8 million. Year two and beyond: $500k/year.

The vendor's pitch is compelling. By year three, you have spent $3.8M +$ 500K + $500K =$ 4.8M and own the cluster outright. Your AWS equivalent over three years: $14.1M. You save roughly$ 9.3M over three years. The math seems obvious.

But your CFO pushes back. "What is our GPU utilization on that cluster?" You look at the monitoring data. Your AWS utilization is effectively 100% - you provision exactly what you need and shut it down. An on-prem cluster needs to handle your peak load (800 GPU-hours/day), which means most days it is sitting at 50% utilization. Idle GPUs do not save you money - they cost you money in power and operations overhead.

Then your infrastructure lead raises another concern. "What if H100s become the standard and A100s are obsolete in 18 months? We are locked into A100 hardware. AWS can just migrate us to p5 instances." This is the hardware generation lock-in problem. Cloud providers absorb the capital risk of technology refreshes. On-prem owners eat the depreciation.

This decision plays out at every AI company, at every scale, every budget cycle. There is no universal right answer - but there is a rigorous framework for finding the right answer for your specific situation. This lesson builds that framework from the ground up: cost models, break-even analysis, utilization economics, and the operational realities that spreadsheets do not capture.

The stakes are high. The GPU infrastructure decisions made in 2024-2026 will define which AI companies survive to 2030. Getting this wrong does not just waste money - it constrains your ability to train competitive models and moves your competitive timeline out by months or years.

Why This Exists - The Economics of Scale Changed Everything

Before the deep learning era, most ML compute was small enough that on-premise servers handled it trivially. A team of researchers could run experiments on a rack of workstations. The decision was easy: buy hardware, run it for years.

The GPU compute explosion changed everything. Training costs went from thousands of dollars to millions. The hardware refresh cycle accelerated - GPU generations now turn over every 18-24 months. The staffing requirements for running a large GPU cluster went from a part-time sysadmin to a dedicated platform engineering team.

Cloud computing emerged as the flexible alternative. AWS launched EC2 GPU instances in 2010. The initial value proposition was simplicity: no hardware to buy, no data center to manage, scale up and down on demand. For early deep learning workloads, this was transformative. A researcher could spin up a cluster overnight and delete it when done.

The problem is that "pay as you go" pricing is convenient but expensive at scale. AWS charges a substantial premium for the flexibility. An on-demand A100 on AWS in 2023 costs approximately $3.50-4.00/GPU-hour. The actual hardware cost of an A100 amortized over 3 years of continuous use is closer to$ 0.80-1.00/GPU-hour (hardware cost ~$10,000-15,000 per GPU, 3-year depreciation, ~26,000 GPU-hours per GPU per year). Cloud pricing includes cloud provider margin, the cost of flexibility, management overhead, networking, storage, and support. But at sufficient scale, if your utilization is high and predictable, the premium is hard to justify.

The result is a bifurcated market: small teams and variable-workload teams use cloud; large teams with stable high utilization build on-prem or use specialty cloud providers (CoreWeave, Lambda Labs) that offer cloud flexibility at near-on-prem economics.

Historical Context - How the Decision Evolved

2012-2016: GPU compute was still manageable. AlexNet trained on two GTX 580 GPUs. ResNet-50 trained on 8 GPUs. The cost of training was in the hundreds to thousands of dollars. On-premise lab machines with a few GPUs handled most research.

2017-2019: The scale inflection point. Transformer (Vaswani et al., 2017) introduced architectures that scaled with compute. BERT (2018) and GPT-2 (2019) pushed training costs into the tens of thousands. Cloud started making sense for teams without hardware.

2020-2021: The frontier model era. GPT-3 (2020) reportedly cost $4-12 million in compute. At this scale, the build-vs-buy analysis became serious business. OpenAI's exclusive Microsoft Azure deal and Google's internal TPU investments reflected two different answers to the same question. On-premise investments in DGX clusters became common at well-funded AI labs.

2022-2023: GPU shortage changes everything. The post-ChatGPT demand surge created a GPU supply crisis. Companies could not buy enough on-prem hardware even if they wanted to. Cloud providers and specialty providers (CoreWeave, Lambda Labs, Together) emerged as the only option for teams that needed GPUs immediately. CoreWeave in particular built a business model around renting GPU capacity that Oracle and other hyperscalers were not buying.

2024-2025: The market matures. GPU supply normalized somewhat. The on-prem vs cloud decision became nuanced: some workloads (long training runs, stable demand, data sovereignty requirements) favor on-prem or reserved cloud. Others (inference, experimentation, variable demand) favor on-demand cloud. Multi-cloud strategies emerged to optimize across providers.

The "aha moment" for the industry: utilization rate is the single most important variable in the build-vs-buy decision. A cluster at 90% utilization pays for itself much faster than the same cluster at 40% utilization. Teams that centralized their GPU clusters (serving multiple teams from a shared pool) dramatically improved utilization and made on-prem economics work.

Core Concepts - The TCO Framework

What Goes Into Total Cost of Ownership (TCO)

TCO analysis for GPU infrastructure requires capturing all costs, not just hardware. Many teams make the mistake of comparing AWS per-hour pricing against only hardware cost. The full picture:

Cloud costs:

On-demand instance pricing (list price)
Reserved instance savings (1-year or 3-year commitment, up to 40-60% discount)
Spot instance pricing (60-90% discount, with interruption risk)
Data egress fees (can be significant for large model artifact transfers)
Managed service overhead (EKS for Kubernetes, S3 for storage, CloudWatch for monitoring)
Support plan costs (Enterprise support on AWS adds 10%+ of monthly spend)

On-premises costs:

Hardware: GPUs, servers, networking (InfiniBand or Ethernet), storage
Data center: rack space, power, cooling (PUE overhead)
Power: ongoing OpEx, at current US commercial rates ($0.08-0.15/kWh)
Networking: ISP links for internet-facing services
Staff: platform engineers, SREs, on-call rotation
Software: licenses, monitoring tools, scheduler (Slurm, Kubernetes)
Maintenance contracts: hardware warranties, support
Hardware refresh: what happens after 3 years when hardware is obsolete

The Cost Model

Let's build a per-GPU-hour cost model for both options.

On-premises (DGX H100 SXM5 node):

A DGX H100 system (8x H100 80GB SXM5) costs approximately $350,000-400,000 list price (2024). InfiniBand networking, storage, and rack infrastructure add ~$ 50,000 per node. Total per-node hardware cost: ~$425,000.

Per GPU hardware cost: $425,000 / 8 =$ 53,125.

Amortize over 3 years (typical depreciation): $53,125 / 3 =$ 17,708/GPU/year.

Operating hours per year: 24 * 365 = 8,760 GPU-hours.

At 80% utilization (realistic for a well-managed shared cluster): 7,008 productive GPU-hours/year.

Hardware amortized cost: $17,708 / 7,008 = **$ 2.53/productive GPU-hour**.

Add operational costs. A 64-GPU cluster (8 DGX nodes) requires approximately 1-2 FTE for operations. At $250k/year fully loaded (salary + benefits + overhead):

Operational cost per GPU-hour: ( $250,000 * 1.5) / (64 GPUs * 7,008 hours) =$ 375,000 / 448,512 = $0.84/GPU-hour.

Power cost. Each H100 draws up to 700W (SXM5 max TDP). At 80% utilization, assume average 560W per GPU. Data center PUE of 1.3x.

Annual power cost per GPU: 0.560 kW * 8,760 hours * 1.3 PUE * $0.10/kWh =$ 637/GPU/year.

At 7,008 productive hours: $637 / 7,008 productive hours * 8,760 = **$ 0.091/productive GPU-hour**.

Networking overhead (InfiniBand fabric amortized): ~$0.10/GPU-hour.

Total on-prem cost: ~$3.56/productive GPU-hour at 80% utilization.

AWS equivalent - p5.48xlarge (8x H100 SXM5):

On-demand: $98.32/hour per instance =$ 12.29/GPU-hour
1-year reserved (no upfront): $64.43/hour =$ 8.05/GPU-hour
3-year reserved (partial upfront): ~ $42/hour =$ 5.25/GPU-hour
Spot (H100): varies, typically $5-8/GPU-hour depending on availability

At 3-year reservation: $5.25/GPU-hour vs on-prem$ 3.56/GPU-hour.

The on-prem advantage is real but only materializes at 80%+ utilization. Let's see what happens at lower utilization.

Break-Even Utilization Analysis

At utilization rate $u$ (fraction of time the GPU is doing productive work):

On-prem effective cost per productive GPU-hour:

$C_{onprem}(u) = \frac{C_{hardware\_annual} + C_{ops\_annual} + C_{power\_annual}}{N_{GPUs} \times 8760 \times u}$

As $u$ decreases, the denominator shrinks (fewer productive hours), but the numerator stays the same (you still pay for the cluster whether it's busy or idle). Cost per productive hour rises.

Cloud cost per productive GPU-hour (reserved): fixed at $C_{cloud}$ regardless of utilization (you pay per hour used, not per hour you own the hardware).

Break-even occurs when $C_{onprem}(u^*) = C_{cloud}$ .

Solving for $u^*$ : for the H100 case above (on-prem total annual cost per GPU ~$18,628, at 8,760 hours/year):

$u^* = \frac{18,628}{8,760 \times C_{cloud}}$

At $C_{cloud} =$ 5.25 (3-year reserved AWS):

$u^* = \frac{18,628}{8,760 \times 5.25} = \frac{18,628}{45,990} = 0.405$

Break-even is at approximately 40.5% utilization. Above 40% utilization, on-prem is cheaper. Below 40%, cloud is cheaper.

This is the key insight: on-prem wins if and only if you can keep the cluster busy. Teams that pool multiple workloads (training, fine-tuning, evaluation, inference) on a shared cluster can achieve 70-85% utilization and capture the full on-prem cost advantage. Teams with bursty or unpredictable workloads will struggle to reach 40% and should use cloud.

Cloud Provider Landscape

AWS GPU Instances

AWS offers several GPU instance families relevant to ML training:

p4d.24xlarge - 8x A100 40GB SXM, 96 vCPUs, 1.1TB RAM, 400 Gbps EFA networking.

On-demand: $32.77/hour ($ 4.10/GPU-hour)
Best for: A100 training workloads, multi-node via EFA

p5.48xlarge - 8x H100 80GB SXM5, 192 vCPUs, 2TB RAM, 3.2 Tbps EFA networking.

On-demand: $98.32/hour ($ 12.29/GPU-hour)
Best for: frontier model training, H100-optimized code (FlashAttention-3, FP8)

g5.48xlarge - 8x A10G 24GB, 192 vCPUs, 768GB RAM, 25 Gbps networking.

On-demand: $16.29/hour ($ 2.04/GPU-hour)
Best for: inference, fine-tuning smaller models

trn1.32xlarge - 16x AWS Trainium, 512GB HBM2e, 800 Gbps EFA.

Competitive pricing vs p4d, but requires rewriting training code for Neuron SDK (not PyTorch-native).

UltraClusters: AWS assembles p5 instances into high-density clusters with 1600 Gbps EFA bandwidth per node and optimized topology. Requires requesting in advance. This is how you get 4096-H100 training clusters on AWS.

GCP GPU Instances

a2-megagpu-16g - 16x A100 40GB, 96 vCPUs, 1.4TB RAM.

On-demand: ~$73/hour
Supports multi-node via Google's RDMA network (RoCE v2)

a3-highgpu-8g - 8x H100 80GB SXM5, 208 vCPUs, 1.87TB RAM, 3.2 Tbps NIC.

On-demand: ~$98/hour
Best for frontier training on GCP

TPU v4 pods - Google's custom accelerator, up to 4096 chips per pod.

Price-performance competitive with H100 for JAX workloads
Not compatible with PyTorch-native code (requires JAX or TensorFlow)

Azure GPU Instances

Standard_ND96amsr_A100_v4 - 8x A100 80GB SXM4, 96 vCPUs, 1.9TB RAM, 800 Gbps InfiniBand HDR.

On-demand: ~$32/hour
Notable: InfiniBand (not EFA) - better NCCL performance than EFA for some workloads

Standard_ND96isr_H100_v5 - 8x H100 80GB SXM5, 96 vCPUs, 900GB RAM, 3.2 Tbps NDR InfiniBand.

On-demand: ~$98/hour

Specialty Cloud Providers

CoreWeave - Built specifically for GPU workloads. H100 pricing: ~ $2.00-2.50/GPU-hour (vs AWS$ 12.29 on-demand). CoreWeave uses wholesale cloud pricing and passes savings to customers. InfiniBand-connected H100 clusters. Significant discounts with reserved capacity contracts. The go-to choice for teams that want cloud flexibility at near-on-prem costs.

Lambda Labs - ML-focused cloud. H100 clusters at ~$1.99-2.49/GPU-hour. Simpler interface than AWS. Good for teams that do not need AWS ecosystem integration.

Together AI, Vast.ai, RunPod - Cheaper options for smaller workloads ($0.80-1.50/H100-hour). Lower SLAs, more variability in hardware quality and network performance. Fine for experiments, not for multi-week production training runs.

The specialty provider landscape has materially changed the economics. For a team running 1000 GPU-hours/day on H100s:

AWS on-demand: $12,290/day
AWS 3-year reserved: $5,250/day
CoreWeave reserved: ~$2,000-2,500/day
On-prem (at 80% utilization): ~$3,560/day

CoreWeave's pricing narrows the on-prem advantage considerably. The on-prem vs cloud decision is increasingly an on-prem vs specialty-cloud-provider decision.

Reserved vs Spot vs On-Demand Pricing Strategy

Spot Instance Economics

AWS Spot and GCP Preemptible instances offer 60-90% discounts in exchange for potential interruptions. Historical spot pricing for H100 instances on AWS averages 40-70% of on-demand pricing (varies by region, availability zone, and demand).

Spot instances make economic sense when:

Your training code is fault-tolerant (checkpointing + TorchElastic, as covered in the previous lesson)
Your job can tolerate interruptions and restarts
Your workload can be completed in reasonable wall-clock time despite potential interruptions

The expected cost savings from spot vs on-demand:

If spot discount is 70% and interruption probability per hour is 2%, you need to account for the wasted compute on interruptions. With async checkpointing and 30-minute checkpoint intervals, an interruption wastes at most 30 minutes of compute. Expected wasted compute per GPU-hour: 0.02 * 0.5 hours = 0.01 GPU-hours. Cost of waste: 0.01 * on-demand_price = negligible compared to 70% discount.

Spot is almost always the right choice for fault-tolerant training workloads.

# AWS CLI: launch a spot fleet for training
aws ec2 request-spot-fleet --spot-fleet-request-config '{
    "IamFleetRole": "arn:aws:iam::123456789:role/aws-ec2-spot-fleet-role",
    "AllocationStrategy": "lowestPrice",
    "TargetCapacity": 8,
    "SpotPrice": "6.00",
    "LaunchSpecifications": [
        {
            "InstanceType": "p5.48xlarge",
            "ImageId": "ami-0abcdef1234567890",
            "SubnetId": "subnet-0123456789abcdef0",
            "IamInstanceProfile": {"Arn": "arn:aws:iam::123456789:instance-profile/TrainingRole"},
            "UserData": "base64-encoded-startup-script"
        }
    ],
    "Type": "request"
}'

Savings Plans and Reserved Instances

For baseline steady-state compute (not peak), Reserved Instances (or Savings Plans on AWS) offer significant discounts with commitment:

1-year No Upfront: ~25-35% discount vs on-demand
1-year All Upfront: ~35-40% discount
3-year Partial Upfront: ~50-60% discount

Strategy: reserve capacity for your floor utilization (the compute you are certain you will use), and use spot for your variable/peak load. This is the "hybrid stack" approach:

Total compute = Reserved baseline + Spot overflow + On-demand emergency

Example: a team that needs 400-800 GPU-hours/day should reserve ~300 GPU-hours/day (the minimum they always need) and use spot for the remaining 100-500 GPU-hours/day.

Multi-Cloud Strategies

Why Multi-Cloud for GPUs

GPU availability varies dramatically across cloud providers, regions, and time. During the 2023-2024 GPU shortage, H100 instances were simply unavailable on AWS for weeks in some regions. Teams that had multi-cloud capability could shift to GCP or Azure when AWS capacity was constrained.

Multi-cloud also provides negotiating leverage. A credible ability to migrate workloads to a competing provider strengthens your position in enterprise pricing negotiations.

The practical challenges of multi-cloud:

Network egress costs: moving large model artifacts or datasets between clouds is expensive ( $0.08-0.09/GB on AWS). A 1TB model moved monthly costs$ 80-90 in egress fees.
Different APIs: AWS EFA, GCP RDMA, Azure InfiniBand all have different NCCL configurations.
Different storage: S3 vs GCS vs Azure Blob - object storage APIs are similar but not identical.
Identity and access management: different IAM systems, different credential formats.

The solution: abstract cloud-specific details behind a container and Kubernetes layer. If your training jobs run in containers with cloud-agnostic storage access (using a consistent S3-compatible API like MinIO or Rclone), you can move workloads between clouds with configuration changes, not code changes.

Kubernetes for GPU Scheduling

NVIDIA Device Plugin

Kubernetes does not natively understand GPUs. The NVIDIA device plugin extends Kubernetes to expose GPUs as schedulable resources. Pods can then request GPUs just like CPU/memory.

# Install NVIDIA device plugin via Helm
# helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
# helm install --generate-name nvdp/nvidia-device-plugin

# Training pod requesting 8 GPUs
apiVersion: v1
kind: Pod
metadata:
  name: training-job
spec:
  containers:
  - name: trainer
    image: nvcr.io/nvidia/pytorch:24.01-py3
    resources:
      requests:
        nvidia.com/gpu: 8
        memory: "500Gi"
        cpu: "64"
      limits:
        nvidia.com/gpu: 8
    volumeMounts:
    - name: shared-storage
      mountPath: /shared
    env:
    - name: NCCL_DEBUG
      value: "INFO"
  volumes:
  - name: shared-storage
    persistentVolumeClaim:
      claimName: training-pvc
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

NVIDIA's Multi-Instance GPU (MIG) partitions a single A100 or H100 into up to 7 isolated GPU instances, each with dedicated HBM memory, L2 cache, and compute engines. This enables multiple smaller workloads to share a physical GPU with full isolation.

MIG profiles for H100 80GB SXM5:

1g.10gb - 1/7th GPU, 10GB HBM (7 per GPU)
2g.20gb - 2/7th GPU, 20GB HBM (3 per GPU + 1g.10gb)
3g.40gb - 3/7th GPU, 40GB HBM (2 per GPU)
4g.40gb - 4/7th GPU, 40GB HBM (1 per GPU + 3g.40gb)
7g.80gb - Full GPU (1 per GPU, defeats the purpose of MIG)

MIG is valuable for inference workloads and fine-tuning where full GPU memory is not needed. A single H100 with 7x 1g.10gb MIG instances can run 7 concurrent LoRA fine-tuning jobs on 7GB models simultaneously.

# Enable MIG mode on an H100 (requires root)
sudo nvidia-smi -i 0 -mig 1

# Create MIG instances
sudo nvidia-smi mig -cgi 3g.40gb,3g.40gb -C

# Verify MIG instances
nvidia-smi -L
# GPU 0: NVIDIA H100 SXM5 (UUID: GPU-xxxxxxxx)
#   MIG 3g.40gb Device 0: (UUID: MIG-xxxxxxxx)
#   MIG 3g.40gb Device 1: (UUID: MIG-xxxxxxxx)

In Kubernetes, request MIG instances directly:

resources:
  requests:
    nvidia.com/mig-3g.40gb: 1  # Request one 3g.40gb MIG instance
  limits:
    nvidia.com/mig-3g.40gb: 1

Autoscaling with Karpenter

Karpenter is a Kubernetes node autoscaler (originally AWS-native, now multi-cloud) that provisions nodes just-in-time based on pending pod requirements. For GPU workloads, it can launch p5.48xlarge or p4d.24xlarge instances automatically when training jobs are submitted.

# NodePool for GPU training workloads
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: gpu-training
spec:
  template:
    spec:
      requirements:
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["spot", "on-demand"]  # Prefer spot
      - key: node.kubernetes.io/instance-type
        operator: In
        values: ["p5.48xlarge", "p4d.24xlarge"]
      - key: topology.kubernetes.io/zone
        operator: In
        values: ["us-east-1a", "us-east-1b", "us-east-1c"]
      nodeClassRef:
        apiVersion: karpenter.k8s.aws/v1beta1
        kind: EC2NodeClass
        name: gpu-nodeclass
  limits:
    nvidia.com/gpu: 1024  # Max 128 p5 instances (1024 GPUs total)
  disruption:
    consolidationPolicy: WhenEmpty  # Only scale down empty nodes
    consolidateAfter: 30s

FinOps for GPU Clusters

Cost Attribution

Without cost attribution, GPU costs are a black box. You know you spent $4M last month on GPUs but cannot say which team, model, or project drove that spend. This makes optimization impossible.

Kubernetes labels enable cost attribution:

# Every training pod should have these labels
metadata:
  labels:
    team: "research"
    project: "llama-finetune-v3"
    cost-center: "ml-research"
    experiment-id: "exp-20241105-001"
    model-size: "70b"

Use a FinOps tool (OpenCost, Kubecost, or cloud-native cost explorer) to aggregate spend by these labels. Report weekly to team leads: "Team A used 40,000 GPU-hours this month ( $X), Team B used 12,000 GPU-hours ($ Y)."

Idle Detection

Idle GPUs are wasted money. In a shared Kubernetes cluster, jobs sometimes hold GPU reservations without using them (waiting for data, post-processing outputs, or simply bugs). Detect idle GPUs with:

import subprocess
import json
import time
from datetime import datetime


def check_gpu_utilization(threshold_pct: int = 5, duration_minutes: int = 30):
    """
    Detect GPUs that have been below threshold_pct utilization
    for more than duration_minutes. These are idle and should be
    reported to the job owner for reclamation.
    """
    result = subprocess.run(
        ["nvidia-smi",
         "--query-gpu=index,utilization.gpu,memory.used,memory.total",
         "--format=csv,noheader,nounits"],
        capture_output=True, text=True
    )

    idle_gpus = []
    for line in result.stdout.strip().split("\n"):
        parts = [p.strip() for p in line.split(",")]
        gpu_index = int(parts[0])
        utilization = float(parts[1])
        mem_used_mb = float(parts[2])
        mem_total_mb = float(parts[3])

        mem_utilization = mem_used_mb / mem_total_mb * 100

        if utilization < threshold_pct and mem_utilization > 10:
            # GPU is idle but has memory allocated (held by a process)
            idle_gpus.append({
                "gpu_index": gpu_index,
                "utilization": utilization,
                "memory_used_gb": mem_used_mb / 1024,
                "detected_at": datetime.utcnow().isoformat()
            })

    return idle_gpus


# Run continuously and alert after duration_minutes of sustained idleness
# In production: ship this data to your monitoring system (Prometheus/Grafana)
# and trigger PagerDuty alerts to job owners after configurable idle threshold

Right-Sizing

Right-sizing means matching the GPU count and type to actual job requirements, not to what was convenient to request.

Common over-provisioning patterns:

Requesting 8 GPUs for a fine-tuning job that only needs 2 (the developer copied a training job config)
Using A100 instances for inference workloads that would run fine on A10G (3x cheaper)
Using multi-node jobs for models that fit on a single node (network overhead wastes GPU cycles)

Right-sizing checklist:

Is GPU memory utilization > 70% at peak? If not, consider a smaller GPU or more aggressive gradient checkpointing.
Is GPU compute utilization > 60% on average during training? If not, the data pipeline may be the bottleneck (CPU/IO bound).
Is multi-node all-reduce bandwidth > 30% of total step time? If yes, consider whether the model actually needs multiple nodes or fits on one.

Mermaid Diagrams

Cloud vs On-Prem Decision Framework

Cost Per GPU-Hour by Utilization Rate

Kubernetes GPU Scheduling Architecture

Production Engineering Notes

Capacity Planning for On-Prem Clusters

Purchasing on-prem hardware requires forecasting 12-18 months ahead (procurement lead times for DGX systems). The planning process:

Baseline current utilization: measure GPU-hours consumed per week over the last 6-12 months. Identify the trend.
Model future demand: project training run frequency, model size growth, inference serving requirements.
Size to 75-80% average utilization: over-buy for 100% and you waste money on idle hardware; under-buy for 90% and you block training runs waiting for capacity.
Account for maintenance windows: on-prem clusters need scheduled downtime for firmware updates, cooling maintenance, and hardware replacements. Budget 5-10% downtime.
Include spare nodes: on-prem clusters should have 5-10% spare capacity for node replacement during hardware failures without reducing usable capacity.

Networking for On-Prem: InfiniBand vs RoCE

The network choice for on-prem GPU clusters is InfiniBand (IB) vs RoCE (RDMA over Converged Ethernet). Both deliver RDMA (Remote Direct Memory Access) which is required for NCCL performance at scale.

InfiniBand HDR (200 Gbps per port):

Purpose-built for HPC
Lower latency than RoCE (~600 nanoseconds vs ~1-2 microseconds)
Requires IB-specific switches (Mellanox/NVIDIA Quantum)
Higher cost: IB HDR switch with 40 ports costs ~$50,000-80,000
NCCL has native IB support, well-tested

RoCE v2 (25/100/200/400 Gbps via standard Ethernet):

Uses standard Ethernet switches (Arista, Cisco, Juniper) with RDMA capability
Requires Priority Flow Control (PFC) and ECN for lossless fabric
Lower switch cost per port vs IB
More operational complexity (configuring lossless Ethernet is non-trivial)
AWS EFA is based on RoCE - so workloads developed on AWS may have NCCL tuned for EFA

For most teams building new on-prem clusters in 2024-2025, InfiniBand NDR (400 Gbps) is the standard choice for high-end training. RoCE is viable and increasingly common as Ethernet speeds have caught up.

Data Egress Costs - The Hidden Cloud Cost

Data egress from cloud is expensive and often underestimated:

AWS to internet: $0.09/GB (first 10TB/month)
GCP to internet: $0.08/GB
Azure to internet: $0.087/GB
Cloud to cloud (cross-region): $0.02-0.08/GB

For a team training a 70B model:

Model checkpoints: 140GB per checkpoint, 5 checkpoints kept = 700GB in cloud storage
If you download checkpoints daily for evaluation on a separate system: 140GB * 30 days = 4.2TB/month = $378/month in egress from AWS alone

For a frontier model at 1TB+ with frequent checkpoint downloads: egress costs reach thousands of dollars per month. This is a real line item that should appear in your TCO analysis.

Mitigation: do all processing within the same cloud region/provider. Only download final model artifacts. Use VPC endpoints for S3 access from EC2 (no egress charges for same-region S3 from EC2).

Build the Platform Team Before the Cluster

A common mistake: companies buy on-prem hardware before building the team to run it. The hardware arrives, sits at partial utilization because nobody has set up the job scheduler, monitoring, or storage system properly, and the economics immediately look bad.

The correct order: hire the platform engineering team 3-6 months before hardware arrives. Use that time to build and test the software stack on cloud instances. When the hardware arrives, you have a production-ready platform to deploy on day one.

Platform engineering requirements for a 256+ GPU on-prem cluster:

2-3 engineers for day-to-day operations, monitoring, and user support
1 senior engineer for platform architecture and Kubernetes/Slurm management
On-call rotation for hardware failures and job stuck incidents
Vendor support contracts for hardware (NVIDIA Enterprise, Dell EMC, etc.)

Common Mistakes

:::danger Do Not Compare Sticker Price - Compare Total Cost

The most common mistake in the build-vs-buy analysis: comparing AWS on-demand pricing against only the GPU hardware cost. On-prem infrastructure includes networking ($50k+/node), storage, power infrastructure, staff, and operational overhead. Ignoring these makes on-prem look unrealistically cheap. Build a complete model that includes ALL costs on both sides, and validate it against actual invoices from peer organizations before committing to a multi-million dollar hardware purchase.

:::

:::danger GPU Utilization Must Be Measured, Not Assumed

Organizations routinely overestimate their GPU utilization when making the on-prem case. "We run training jobs 18 hours a day" does not mean 75% GPU utilization - it means 75% of the time there is a job running, but that job might be at 40% GPU efficiency due to data pipeline bottlenecks. Measure actual GPU utilization with nvidia-smi metrics before extrapolating to TCO models.

:::

:::warning Spot Instances Are Not Free Reliability Risk

Spot instances can be interrupted. Without proper fault tolerance (TorchElastic + async checkpointing), a spot interruption wastes all progress since the last checkpoint. Before committing to a spot-heavy cost model, verify that your training code is actually fault-tolerant. Run a test: start a training job on spot instances and manually terminate one instance. Does it recover gracefully? Only base your cost model on spot pricing if it does.

:::

:::warning Hardware Generation Lock-In Is Real

When you buy an on-prem cluster, you are committing to that GPU generation for 3-5 years. If NVIDIA releases a GPU that is 3x more efficient (like H100 was vs A100), your competitors on cloud can migrate immediately. You are still running your A100 cluster. Model this hardware obsolescence risk in your break-even analysis. Cloud optionality has real value - it is not just a marketing claim. For organizations at the frontier of model development, this optionality may be worth the price premium.

:::

:::warning Reserved Instance Commitments Can Backfire

1-year and 3-year Reserved Instance (RI) commitments offer significant discounts but are non-cancellable (or carry steep early termination fees). If your GPU demand drops (project cancelled, team restructured, efficiency improvements reduce compute needs), you are still paying for the reserved capacity. Only reserve capacity that you are highly confident you will use. Under-reserve and use spot for uncertainty rather than over-reserve and pay for idle reservations.

:::

Interview Q&A

Q1: Walk me through how you would decide between on-premises GPU infrastructure and cloud for a team training a 7B parameter language model 3-4 times per year, with each run taking about 2 weeks.

This is a relatively low-utilization scenario. Four 2-week training runs per year on a 7B model (which requires perhaps 32-64 GPUs at reasonable throughput) means roughly 3,584-7,168 GPU-hours/year (64 GPUs * 14 days * 24 hours * 4 runs). On AWS p4d on-demand at $4.10/GPU-hour:$ 14,700- $29,400/year. On CoreWeave at$ 2.00/GPU-hour: $7,168-$ 14,336/year. On-prem hardware for 64 A100s would cost ~ $2-3M (including networking and infrastructure), with annual ops costs of$ 300-500k. For this workload, on-prem is wildly uneconomical - the hardware cost exceeds 50+ years of cloud spend. Use cloud (specifically CoreWeave or spot AWS) without hesitation.

Q2: What is the break-even utilization rate for on-prem vs cloud, and why does it matter for cluster sizing decisions?

Break-even utilization is the minimum fraction of time GPUs must be productively running for on-prem to cost less per GPU-hour than cloud. For H100 hardware (as modeled in this lesson), break-even with AWS 3-year reserved pricing is approximately 40-45% utilization. The implication for cluster sizing: if you size your cluster to your peak demand, average utilization will typically be 50-70% of peak. If your peak is 1000 GPUs but average is 400 GPU-hours/day, a 1000-GPU cluster runs at 40% average utilization - right at the break-even point. This is why shared clusters (multiple teams, multiple workloads) dramatically improve the economics: pooling demand smooths utilization curves and pushes average utilization well above break-even.

Q3: Explain the NVIDIA MIG feature and when you would use it for GPU scheduling in Kubernetes.

MIG (Multi-Instance GPU) partitions an A100 or H100 into multiple isolated GPU instances, each with dedicated HBM memory slices, L2 cache partitions, and compute engine partitions. It enables a single physical GPU to run multiple workloads with full hardware isolation (no memory sharing, no cross-process interference). In Kubernetes, MIG instances are exposed as separate resource types (nvidia.com/mig-3g.40gb, nvidia.com/mig-1g.10gb) that pods can request independently. MIG is valuable for inference serving (running multiple small models concurrently on one GPU), for fine-tuning with LoRA adapters (7B model + LoRA fits in 10-14GB), and for multi-tenant clusters where multiple teams need GPU access simultaneously without contention. MIG is not useful for large training runs where you need the full GPU (or multiple GPUs) - the partitioning reduces peak throughput per workload.

Q4: What is FinOps for GPU clusters and what are the most impactful FinOps practices for an AI team?

FinOps for GPU clusters is the practice of making cloud (or on-prem) GPU spend visible, accountable, and optimized. The most impactful practices in order of leverage: (1) Cost attribution via Kubernetes labels - without knowing which team or project is spending what, you cannot optimize. Tag every pod with team, project, and experiment labels. (2) Idle GPU detection - GPUs holding reservations without running useful compute are pure waste. Detect and reclaim them. (3) Right-sizing - match GPU type and count to actual requirements. Use A10G for inference instead of A100. (4) Spot instance adoption for fault-tolerant training - 60-90% savings with minimal risk if your code handles interruptions. (5) Reserved instance matching - reserve capacity for your predictable baseline, use spot for burst. These five practices together can reduce GPU spend by 40-60% with no degradation in training throughput.

Q5: Compare the networking options for on-prem GPU clusters: InfiniBand HDR vs RoCE v2. When would you choose each?

InfiniBand HDR provides 200 Gbps per port (NDR provides 400 Gbps) with ~600ns latency, purpose-built for HPC, with mature NCCL support. It requires IB-specific Mellanox/NVIDIA Quantum switches, which are expensive but reliable and operationally simpler (IB handles congestion natively). RoCE v2 delivers RDMA over standard Ethernet (25/100/200/400 Gbps depending on NIC/switch) with ~1-2 microsecond latency, using standard Ethernet switches but requiring careful lossless configuration (Priority Flow Control, Explicit Congestion Notification). For most new on-prem clusters focused on model training, InfiniBand NDR (400 Gbps) is the current standard. It has lower latency, better NCCL performance at scale, and simpler operations. RoCE makes sense when you already have substantial Ethernet infrastructure investment, when your workloads are less sensitive to collective communication latency (e.g., pipeline-parallel with coarse-grained communication), or when cost per port is the primary constraint.

Q6: How does Karpenter improve GPU scheduling economics on Kubernetes, and what are its limitations?

Karpenter autoscales Kubernetes nodes just-in-time based on pending pod requirements. For GPU workloads, it watches for pods requesting GPU resources that cannot be scheduled (no nodes with free GPUs), then automatically launches the right instance type in the right availability zone. This eliminates the need to pre-provision GPU nodes that sit idle waiting for jobs. Key benefits: (1) Spot instance integration - Karpenter can launch spot instances and fail over to on-demand if spot capacity is unavailable, optimizing cost automatically. (2) Bin packing - Karpenter tries to consolidate workloads to minimize the number of running nodes, terminating empty nodes. (3) Instance type flexibility - you can specify multiple instance types and Karpenter picks the cheapest available. Limitations: Karpenter has launch latency (~1-2 minutes to provision a new node), which is acceptable for training jobs but not for real-time inference. It cannot provision nodes faster than the cloud provider's instance launch time. For latency-sensitive workloads, pre-provisioned node pools with cluster autoscaler are still needed.

Q7: A company is spending $8M/year on AWS on-demand GPU instances. Their CFO is asking whether to build an on-prem cluster. What questions do you ask first?

The questions, in order of importance: (1) What is the current GPU utilization? (measure with nvidia-smi metrics, not job scheduling statistics - actual GPU compute utilization). Below 50% utilization changes the calculus significantly. (2) How predictable is the GPU demand? If it swings from 200 to 2000 GPU-hours/day, a fixed cluster is inefficient. (3) Is there a data sovereignty or regulatory requirement that mandates on-prem? If yes, the decision may be made for you. (4) What is the team's current operational capacity? Do they have platform engineers who can run a cluster? (5) What is the GPU generation outlook? Are they currently bottlenecked on H100 and expecting to need H200 or Blackwell in 18 months? (6) Have they tried CoreWeave or other specialty providers? $8M at AWS on-demand could be$ 1.5-2M at CoreWeave with reservations, changing the break-even analysis entirely. Most companies spending $8M/year on AWS on-demand GPUs have not optimized cloud spend at all - the right first step is usually optimization (spot instances, reserved capacity, specialty providers), not an immediate jump to on-prem.

Summary

The cloud vs on-prem GPU decision has no universal answer - it depends on utilization rate, run duration, team operational capacity, data sovereignty requirements, and hardware generation risk. The frameworks in this lesson give you the tools to make the decision quantitatively rather than following intuition or vendor pitch decks.

The key variables: break-even utilization is around 40-45% for on-prem H100 vs AWS 3-year reserved pricing. Above that threshold, on-prem wins. Below it, cloud wins. Most teams making this decision do not know their actual GPU utilization - measure first, decide second.

The specialty cloud providers (CoreWeave, Lambda Labs) have materially changed the calculus. For teams that want cloud flexibility but not cloud pricing, these providers offer a middle path that makes the on-prem advantage much narrower.

Whatever infrastructure you choose: invest in FinOps practices from day one. Cost attribution, idle detection, and right-sizing consistently deliver 40-60% cost reductions with zero impact on research productivity. The most expensive GPU is an idle one.

The $50 Million Question​

Why This Exists - The Economics of Scale Changed Everything​

Historical Context - How the Decision Evolved​

Core Concepts - The TCO Framework​

What Goes Into Total Cost of Ownership (TCO)​

The Cost Model​

Break-Even Utilization Analysis​

Cloud Provider Landscape​

AWS GPU Instances​

GCP GPU Instances​

Azure GPU Instances​

Specialty Cloud Providers​

Reserved vs Spot vs On-Demand Pricing Strategy​

Spot Instance Economics​

Savings Plans and Reserved Instances​

Multi-Cloud Strategies​

Why Multi-Cloud for GPUs​

Kubernetes for GPU Scheduling​

NVIDIA Device Plugin​

GPU Sharing with MIG (Multi-Instance GPU)​

Autoscaling with Karpenter​

FinOps for GPU Clusters​

Cost Attribution​

Idle Detection​

Right-Sizing​

Mermaid Diagrams​

Cloud vs On-Prem Decision Framework​

Cost Per GPU-Hour by Utilization Rate​

Kubernetes GPU Scheduling Architecture​

Production Engineering Notes​

Capacity Planning for On-Prem Clusters​

Networking for On-Prem: InfiniBand vs RoCE​

Data Egress Costs - The Hidden Cloud Cost​

Build the Platform Team Before the Cluster​

Common Mistakes​

Interview Q&A​

Summary​