Skip to main content

Choosing Custom Silicon vs GPUs

Reading time: ~40 min · Interview relevance: High · Target roles: ML Infrastructure Engineer, Engineering Manager, Principal Engineer

The right hardware is not always the fastest hardware. A chip that runs 2x faster but requires 6 months to port your codebase is slower in practice. Hardware selection is a systems problem that requires analyzing workload fit, ecosystem maturity, total cost of ownership, and team capability simultaneously - not just reading benchmark numbers.


Opening Scenario: The $50M Decision

A VP of Engineering at a mid-sized AI company receives a mandate: build the infrastructure to train a new 70B-parameter LLM and serve it at scale. The board has approved $50M for hardware over 3 years. She must decide: NVIDIA H100 cluster, Google TPU v4 Pods via GCP, AWS Trainium 2, Intel Gaudi 2, or some combination.

Her first instinct is to look at benchmark numbers. NVIDIA publishes 3.9 PFLOPS for H100 FP8. Google claims TPU v4 outperforms H100 on transformer training workloads. AWS claims 50% cost savings versus comparable GPU instances for Trainium 2.

She forwards the benchmarks to her head of infrastructure, who returns with a two-word response: "It's complicated."

The benchmarks are best-case numbers on vendor-selected workloads with vendor-optimized code. The H100 FLOPS number assumes Tensor Core utilization close to 100%, which requires careful batch sizing and sequence length tuning. The TPU claim uses Google's JAX framework on a configuration her team has no experience with. The Trainium cost savings assume the team can port their PyTorch training code to Neuron SDK, which has incomplete operator coverage and a steep learning curve.

Her real question is not "which chip is fastest?" It is "which chip will let our team ship the model on time and on budget?" These are different questions with different answers. This lesson gives you the framework to answer both.


Why Benchmark Numbers Lie

Before the framework, understand the problem with vendor benchmarks.

Peak vs. achieved performance - Peak FLOPS numbers are achievable only under ideal conditions: perfectly sized tensors, 100% hardware utilization, no memory bottlenecks, no communication overhead. Real training workloads typically achieve 30-60% of theoretical peak on GPUs, and less on less-mature hardware. A chip with 2x the FLOPS on paper may deliver 1.2x in practice.

The operator coverage problem - Custom silicon vendors (Trainium, Gaudi, Groq) support the most common neural network operations efficiently, but have gaps for less common operators. If your model uses a custom attention variant, a non-standard normalization layer, or a novel MoE routing algorithm, those operators may fall back to a software emulation path that is 10-100x slower than the hardware path. One unsupported operator in the critical path can negate all the hardware advantages.

Communication is always the bottleneck at scale - Single-chip benchmarks measure compute throughput. Multi-chip training benchmarks measure compute plus communication. The chip-to-chip interconnect (NVLink, Google ICI, AWS EFA, Intel Gaudi's OmniPath) determines how well the hardware scales. A chip that is 2x faster on a single device but 50% slower at 512-chip scale has negative value for large model training.

Software maturity multiplies hardware capability - NVIDIA has 15 years of CUDA ecosystem development. The libraries (cuDNN, cuBLAS, Megatron-LM, DeepSpeed, FlashAttention) are battle-tested, highly optimized, and constantly improving. Alternative hardware vendors have software that is 1-5 years old, often with rough edges, limited documentation, and smaller debugging communities. The hardware may be 20% faster on a benchmark but 40% slower in production because the software is less optimized.


The Five-Dimension Decision Framework

Evaluating AI hardware requires examining five dimensions simultaneously. No dimension is sufficient alone.

Dimension 1: Workload Fit

The first question is whether the hardware architecture matches your computation pattern.

Training vs. Inference - These have different hardware requirements. Training requires: high FP16/BF16 throughput for forward/backward passes, large memory for activations and optimizer states (Adam needs 12 bytes per parameter: 4 params + 4 momentum + 4 variance), fast all-reduce communication for gradient synchronization. Inference requires: low latency for single-query serving, high throughput for batch serving, memory bandwidth for loading KV cache and model weights.

Model architecture and size - Transformer models are dominated by GEMM (matrix multiplication) operations, which map well to all modern AI accelerators. Convolutional models have more structured memory access patterns. Recurrent models (LSTM, S4) have sequential dependencies that limit parallelism. Mixture-of-Experts (MoE) models have dynamic routing with load imbalance that challenges static-allocation hardware.

Model size determines memory requirements. Rule of thumb: a model with PP parameters requires:

  • Inference: P×2P \times 2 bytes in FP16 or P×1P \times 1 byte in INT8
  • Training with AdamW: P×16P \times 16 bytes (params FP16 + master FP32 + Adam states)

A 70B parameter model requires 140GB for FP16 inference, 560GB+ for FP32 training. Only the largest GPU configurations (8x H100 with 640GB aggregate HBM3) can hold this in a single node without model parallelism.

Batch size sensitivity - Some hardware (TPUs, Groq) performs best at specific batch sizes. TPU matrix units are designed for large, regular tensor operations. Groq's LPU architecture is optimized for batch size 1. Running Groq at batch 64 does not improve throughput linearly. Know the batch size your production workload requires before selecting hardware.

Dimension 2: Performance

The performance metrics that actually matter, in order of importance:

Memory bandwidth - For transformer inference, the bottleneck is almost always loading model weights from HBM, not compute. A model with 70B parameters at FP16 requires loading 140GB of weights per forward pass. H100 HBM3 bandwidth is 3.35 TB/s. Loading 140GB takes 42 milliseconds - that is your minimum inference latency before any computation starts. Chips with higher memory bandwidth win here.

HBM capacity - For serving large models, whether the model fits in a single device's HBM determines whether you need tensor parallelism (expensive in communication cost). H100 SXM5: 80GB HBM3. H100 NVL (2-chip module): 188GB. A100 80GB: 80GB HBM2e. TPU v4: 32GB HBM. TPU v5e: 16GB HBM. For a 70B FP16 model at 140GB, you need at least 2x H100s or cannot fit it on a single TPU at all.

Compute throughput - For training, BF16 TFLOPS on the matrix compute units. H100 SXM5: 989 TFLOPS BF16. A100 SXM4: 312 TFLOPS BF16 (sparse). Intel Gaudi 2: 432 TFLOPS BF16. AWS Trainium 2: 3,800 TFLOPS BF16 per chip (with 2 NeuronCores). TPU v4: 275 TFLOPS BF16 per chip.

All-reduce bandwidth - For distributed training, the inter-chip communication throughput. H100 NVLink 4.0: 900 GB/s bidirectional per GPU. TPU v4 ICI: 1,200 GB/s per chip. Intel Gaudi 2: 24x 100GbE ports (2,400 GB/s total). Communication bandwidth determines strong scaling efficiency.

Dimension 3: Total Cost of Ownership

TCO over 3 years is the correct financial metric. Single-point comparisons (hourly rate, purchase price) miss the full picture.

The TCO formula:

TCO=Chardware+(Pwatts×Hhours×Rpower)+Ccooling+Cstaff+Csoftware\text{TCO} = C_{\text{hardware}} + (P_{\text{watts}} \times H_{\text{hours}} \times R_{\text{power}}) + C_{\text{cooling}} + C_{\text{staff}} + C_{\text{software}}

Where:

  • ChardwareC_{\text{hardware}} = purchase price or 3-year cloud reservation cost
  • PwattsP_{\text{watts}} = system power draw in watts
  • HhoursH_{\text{hours}} = operating hours over 3 years (26,280 hours at 100% utilization)
  • RpowerR_{\text{power}} = power cost per kWh (typically $0.08-0.15 in US datacenters)
  • CcoolingC_{\text{cooling}} = typically 40-50% of power cost (PUE of 1.4-1.5)
  • CstaffC_{\text{staff}} = engineer time for porting, maintenance, debugging
  • CsoftwareC_{\text{software}} = licensing for enterprise frameworks, support contracts

H100 SXM5 on-premise 3-year TCO (per GPU):

ComponentCost
Hardware purchase (H100 SXM5)$30,000
Power at 700W, 3 years, $0.10/kWh$1,843
Cooling (PUE 1.45 overhead)$724
NVLink switch infrastructure (amortized)$2,500
Staff time (minimal - team knows CUDA)$0 incremental
Total per GPU, 3 years~$35,000

TPU v4 on GCP (per chip-year, on-demand):

ConfigurationCost
On-demand price$3.22/chip-hour
1 year at 60% utilization$16,900
3 years at 60% utilization$50,700 per chip
1 year sustained use discount (3-year reservation)~$1.35/chip-hour
3 years reserved~$35,500 per chip

TPU v4 on reserved pricing becomes cost-competitive with H100 on-premise, but the TPU chip provides less HBM (32GB vs 80GB) and less single-chip throughput. You need more TPU chips to match H100 capacity. The comparison is not chip-to-chip but system-to-system for your target throughput.

AWS Trainium 2 vs. equivalent p4d.24xlarge (8x A100) on-demand:

  • p4d.24xlarge (8x A100 40GB): 32.77/hour=32.77/hour = 287,000/year at continuous use
  • trn2.48xlarge (16x Trainium 2 chips): $55.44/hour
  • Trainium achieves roughly 2x A100 throughput per chip claimed, so 16 Trn2 ~ 32 A100s in compute
  • Equivalent p4de.24xlarge (8x A100 80GB): 40.96/hour,4nodes=40.96/hour, 4 nodes = 143,000/year
  • Trainium 2 equivalent performance at $55.44/hour = 45% cost savings vs matching A100 setup

The savings are real - but only materialize if porting costs and reduced engineer productivity are not greater than the savings.

Dimension 4: Ecosystem Maturity

This is the most underrated dimension and the most common source of surprises.

PyTorch operator coverage - PyTorch has approximately 1,800 built-in operators. Hardware backends implement subsets of these. NVIDIA CUDA through PyTorch: ~100% coverage via native CUDA kernels or ATen fallback. AMD ROCm: ~95% coverage (a few ops lack optimized kernels). Intel Gaudi 2 (via Habana SynapseAI/PyTorch bridge): ~85% coverage, with some complex dynamic operations falling back to CPU. AWS Trainium 2 (via Neuron SDK): ~80% for standard transformer ops, with significant gaps for non-standard architectures.

The practical implication: the 15-20% gap in operator coverage may include exactly the operator your model uses. Before committing to any non-NVIDIA hardware, run your model in eager mode on the target hardware and check every operator. If even one operator in the forward pass falls back to CPU emulation, profile how much it costs.

Debugging toolchain maturity - When training diverges or inference returns wrong values, how do you debug it?

NVIDIA: NSight Compute and NSight Systems for kernel-level profiling, TensorBoard GPU plugin, extensive error messages in CUDA runtime, years of Stack Overflow answers. Intel Gaudi: SynapseAI Profiler with decent support. AWS Trainium: Neuron Profile with reasonable support for standard models. AMD ROCm: ROCm Profiler, improving rapidly. Groq: no distributed training, limited profiling tools (inference-only).

Community size - Community size predicts the quality of help you get when blocked. Search for "[chip name] training error" and count Stack Overflow answers, GitHub issues, and blog posts. CUDA: hundreds of thousands. JAX TPU: tens of thousands. Habana Gaudi: thousands. Neuron (Trainium): hundreds.

Framework-level support summary:

HardwarePyTorch NativeTensorFlowJAXMegatron-LMDeepSpeed
NVIDIA GPUFullFullPartialFullFull
AMD ROCm~95%~90%PartialPartialPartial
Google TPUVia XLAFullFullGoogle-internalLimited
AWS TrainiumVia NeuronVia NeuronNoPartial portLimited
Intel GaudiVia SynapseAIVia SynapseAINoGaudi portPartial
GroqInference onlyInference onlyNoNoNo

Dimension 5: Team Capability and Switching Cost

The final dimension is human: what does it cost your team to switch?

The CUDA expertise tax - Most ML engineers have years of experience with CUDA: understanding memory layouts, debugging CUDA errors, knowing which operations to fuse, understanding when to use flash attention. This expertise does not transfer to other hardware. Switching to TPUs requires learning XLA compilation, JIT tracing constraints, and JAX's functional programming model. Switching to Trainium requires learning the Neuron compiler's model partitioning, debugging Neuron compilation errors, and managing the gaps in operator support.

A conservative estimate: expect a 6-month productivity penalty when transitioning a team of 5 ML engineers to new hardware. At 200kaverageallinengineercost,thatis200k average all-in engineer cost, that is 500k in reduced productivity per transition. This cost is real and rarely appears in vendor cost comparison documents.

The model porting checklist:

  1. Run the model in PyTorch eager mode on the new hardware; identify all operator fallbacks
  2. Profile end-to-end throughput vs. baseline; verify the claimed speedup materializes
  3. Verify numerical accuracy: does training loss converge identically? Does inference output match?
  4. Test distributed training at scale: does multi-chip training converge correctly with your gradient synchronization method?
  5. Test failure handling: what happens when one chip fails? Does the training framework handle it gracefully?
  6. Verify monitoring and observability: can you get the metrics your on-call team needs?

Comprehensive Hardware Comparison Matrix

HardwarePeak BF16 TFLOPSHBM GBHBM BW TB/sTDP WattsBest for
NVIDIA H100 SXM5989803.35700Large model training, inference
NVIDIA H100 NVL835943.9400Large model inference (2-chip)
NVIDIA A100 SXM4312802.0400General training/inference
NVIDIA RTX 4090330241.0450Research, budget training
Google TPU v4275321.2170JAX/XLA training, Google ecosystem
Google TPU v5e393160.82200High-throughput inference
AWS Trainium 23,800 (per chip)966.4700AWS-native LLM training
Intel Gaudi 2432962.46600Cost-sensitive training
Intel Gaudi 31,8351283.7900H100 alternative training
Groq LPU~750230 (SRAM)80300Lowest-latency inference
Cerebras CS-3125,000 (sparse)44GB SRAM21 PB/s23,000Sparse model training
Apple M3 Ultra32 (ANE)192 unified0.860On-device, small model, macOS

Numbers are from vendor specifications or published benchmarks. Achieved performance varies.


The Decision Tree


Detailed Recommendation by Use Case

LLM Pretraining at > 70B Parameters

Winner: NVIDIA H100 cluster or Google TPU v4 Pod (JAX shops)

The reasoning: at this scale, software maturity is critical. Pre-70B training jobs run for weeks or months. Any instability from immature software - gradient divergence from floating-point differences, optimizer state corruption, checkpoint save/load bugs - wastes enormous compute and engineer time. NVIDIA's ecosystem is the most battle-tested at this scale. Megatron-LM, GPT-NeoX, and NeMo have been used at 175B+ parameter scale and the bugs are largely worked out.

TPU v4 Pods are the correct choice if your team already uses JAX and has experience with the XLA compilation model. Meta AI, Google Brain, and DeepMind do some of their largest training runs on TPUs. If you are starting from PyTorch and the team has no JAX experience, the switching cost is too high.

H100 NVLink specifications for multi-node training: NVLink 4.0 provides 900 GB/s bidirectional per GPU for intra-node communication. For inter-node, you need HDR InfiniBand (400 Gb/s) or NDR (800 Gb/s). The choice between NVLink and InfiniBand configurations has significant cost implications.

LLM Fine-Tuning at 7-70B Parameters

Winner: A100 or H100 (performance-first), Gaudi 2 (cost-conscious), RTX 4090 (tight budget)

Fine-tuning is shorter than pretraining, so the risk of software bugs causing multi-week wasted runs is lower. This opens the door to alternative hardware.

Intel Gaudi 2 is worth serious consideration here. At $6-8/hour on AWS dl2q instances (8x Gaudi 2), it is approximately 30-40% cheaper than p4de instances (8x A100 80GB) at comparable throughput for transformer fine-tuning. Intel has invested heavily in Gaudi's PyTorch integration and the operator coverage for standard transformer architectures is solid. The risk: non-standard architectures, custom operators, and cutting-edge techniques (flash attention 3, ring attention) may not be supported.

RTX 4090 for researchers: the RTX 4090 has 24GB VRAM (limiting but workable with 4-bit quantization for 7B models), 330 TFLOPS BF16, and costs $1,600. For researchers fine-tuning 7B models locally, it is the best value-per-dollar in the market. Use QLoRA + bitsandbytes 4-bit quantization and a 7B model fits in 12GB.

High-Throughput LLM Inference

Winner: H100 (maximum scale), AWS Inferentia 2 (cost-efficient), A10G (balanced)

High-throughput inference (serving thousands of requests per minute) is dominated by memory bandwidth (loading KV cache and weights) and compute (attention and FFN). The KV cache problem grows with sequence length and batch size: at 2k context length with batch 64, a 7B model's KV cache alone is 14GB.

AWS Inferentia 2 is compelling for AWS-native deployments. At $1.97/hour for an inf2.xlarge (1 Inferentia 2 chip), it is 3-4x cheaper than equivalent GPU instances for supported model architectures (Llama, Mistral, standard transformers). The Neuron compiler generates highly optimized inference graphs for these models. The caveat: if your model uses non-standard operators or you need to modify the model architecture frequently, Neuron compilation time (minutes to tens of minutes) and operator gaps will cause pain.

A10G (NVIDIA) at $3.21/hour on AWS g5.xlarge provides a good balance: CUDA ecosystem for flexibility, decent memory bandwidth (600 GB/s), 24GB VRAM for 7B models in FP16.

Low-Latency LLM Inference (< 10ms per token)

Winner: Groq (lowest latency), H100 with vLLM (practical)

Groq's Language Processing Unit (LPU) is purpose-built for sequential token generation. At batch size 1, Groq generates 500+ tokens/second for Llama 3 70B - far exceeding H100's ~80 tokens/second at batch 1. The architectural reason: Groq uses on-chip SRAM (230MB per chip) instead of HBM, with 80 TB/s internal bandwidth. Weights are replicated across chips so each forward pass reads from on-chip memory with no DRAM latency.

The limitation: Groq is inference-only, supports a fixed model list (you cannot upload arbitrary models), and the per-token throughput does not scale with batching the way GPU throughput does. For use cases requiring low latency for individual users (chatbots, coding assistants, interactive applications), Groq is the correct choice if your model is on their supported list.

For teams that need H100-level flexibility with competitive latency: vLLM with continuous batching on H100 achieves 100-150 tokens/second at batch sizes of 10-20, with 5-10ms time-to-first-token. This is the production default for most LLM serving companies.

On-Device / Edge Inference

Winner: Apple M-series (macOS/iOS), Snapdragon NPU (Android), Hailo-8L (embedded)

Apple Silicon (M3 series and M4) runs Core ML models with the Apple Neural Engine (ANE) at 38 TOPS with 16-core ANE. The ANE is optimized for INT8 and FP16 quantized models. Combined with 192GB unified memory on M3 Ultra (memory shared between CPU, GPU, and ANE), Apple Silicon enables running 70B models at acceptable speeds on desktop hardware - something impossible on any competitor mobile/desktop platform.

For Android deployment: Qualcomm Snapdragon Gen 3 with Hexagon NPU at 45 TOPS. Use Qualcomm's AI Engine SDK or ONNX Runtime with QNN execution provider.

For embedded/industrial: Hailo-8L at 26 TOPS, 2.5W, in a standard M.2 module. Designed for camera-based applications (object detection, segmentation) at the edge with no cloud dependency.


TCO Calculator: Python Implementation

"""
AI Hardware TCO Calculator
Computes 3-year total cost of ownership for AI accelerator deployments.
"""

from dataclasses import dataclass, field
from typing import Optional
import math


@dataclass
class HardwareSpec:
"""Hardware specification for a single accelerator."""
name: str
bf16_tflops: float # BF16 tensor core TFLOPS
hbm_gb: float # HBM capacity in GB
hbm_bandwidth_tbs: float # HBM bandwidth in TB/s
tdp_watts: float # Thermal design power in watts
purchase_price_usd: float # On-premise purchase price per unit
cloud_hourly_usd: float # Cloud on-demand hourly rate per unit (0 if on-prem only)
available_cloud: bool = True


@dataclass
class WorkloadSpec:
"""Workload characterization for TCO analysis."""
name: str
model_params_billions: float
training: bool # True = training, False = inference
utilization_fraction: float # Target hardware utilization (0.0-1.0)
batch_size: int
required_memory_gb: float # Minimum HBM required per unit
hours_per_year: float = 8760.0 # Default: continuous operation


@dataclass
class DeploymentConfig:
"""Deployment configuration parameters."""
power_cost_per_kwh: float = 0.10
pue: float = 1.45 # Power Usage Effectiveness (cooling overhead)
analysis_years: int = 3
on_premise: bool = True
team_porting_months: float = 0.0 # Months of engineer time to port codebase
engineer_monthly_cost_usd: float = 18_000 # All-in monthly cost per engineer
engineers_on_project: int = 5


@dataclass
class TCOResult:
"""Result of a TCO analysis."""
hardware: HardwareSpec
units_required: int
hardware_cost: float
power_cost_3yr: float
cooling_cost_3yr: float
porting_cost: float
total_tco: float
cost_per_tflop_achieved: float
notes: list = field(default_factory=list)

def summary(self) -> str:
lines = [
f"\n{'='*60}",
f"Hardware: {self.hardware.name}",
f"Units required: {self.units_required}",
f"{'='*60}",
f"Hardware cost: ${self.hardware_cost:>12,.0f}",
f"Power cost (3yr): ${self.power_cost_3yr:>12,.0f}",
f"Cooling cost (3yr): ${self.cooling_cost_3yr:>12,.0f}",
f"Porting cost: ${self.porting_cost:>12,.0f}",
f"{'─'*40}",
f"TOTAL TCO (3yr): ${self.total_tco:>12,.0f}",
f"Cost/TFLOP achieved: ${self.cost_per_tflop_achieved:>12,.4f}",
]
if self.notes:
lines.append("\nNotes:")
for note in self.notes:
lines.append(f" - {note}")
return "\n".join(lines)


class TCOCalculator:
def __init__(self, config: DeploymentConfig):
self.config = config

def calculate(
self,
hardware: HardwareSpec,
workload: WorkloadSpec,
hardware_utilization: float = 0.45, # typical achieved utilization
) -> TCOResult:
"""
Calculate 3-year TCO for a hardware/workload combination.

hardware_utilization: fraction of peak TFLOPS actually achieved
(0.45 is typical for well-tuned GPU training; use 0.3 for
immature software stacks)
"""
cfg = self.config
notes = []

# --- Units required ---
# How many chips needed to fit the model in memory?
units_for_memory = math.ceil(
workload.required_memory_gb / hardware.hbm_gb
)
# Round up to power of 2 for clean tensor parallelism
units_required = max(1, 2 ** math.ceil(math.log2(units_for_memory)))
if units_required > units_for_memory:
notes.append(
f"Rounded up from {units_for_memory} to {units_required} "
f"units for clean tensor parallelism degree"
)

# --- Hardware cost ---
if cfg.on_premise:
hardware_cost = hardware.purchase_price_usd * units_required
else:
# Cloud: compute cost over 3 years at target utilization
cloud_hours = (
workload.hours_per_year
* cfg.analysis_years
* workload.utilization_fraction
)
hardware_cost = (
hardware.cloud_hourly_usd
* units_required
* cloud_hours
)
notes.append(
f"Cloud cost at {workload.utilization_fraction:.0%} utilization, "
f"{workload.hours_per_year:.0f} hr/year"
)

# --- Power cost (on-premise only) ---
if cfg.on_premise:
total_watts = hardware.tdp_watts * units_required
kwh_per_year = total_watts * workload.hours_per_year / 1000
power_cost_per_year = kwh_per_year * cfg.power_cost_per_kwh
power_cost_3yr = power_cost_per_year * cfg.analysis_years
# Cooling adds (PUE - 1) fraction on top of IT power
cooling_cost_3yr = power_cost_3yr * (cfg.pue - 1.0)
else:
# Cloud: power included in hourly rate
power_cost_3yr = 0.0
cooling_cost_3yr = 0.0
notes.append("Power and cooling included in cloud hourly rate")

# --- Porting / switching cost ---
porting_cost = (
cfg.team_porting_months
* cfg.engineer_monthly_cost_usd
* cfg.engineers_on_project
)
if porting_cost > 0:
notes.append(
f"Porting cost assumes {cfg.team_porting_months:.1f} months "
f"x {cfg.engineers_on_project} engineers"
)

# --- Total TCO ---
total_tco = (
hardware_cost
+ power_cost_3yr
+ cooling_cost_3yr
+ porting_cost
)

# --- Cost efficiency: $ per TFLOP-year achieved ---
achieved_tflops = (
hardware.bf16_tflops
* hardware_utilization
* units_required
)
cost_per_tflop = total_tco / (achieved_tflops * cfg.analysis_years)

return TCOResult(
hardware=hardware,
units_required=units_required,
hardware_cost=hardware_cost,
power_cost_3yr=power_cost_3yr,
cooling_cost_3yr=cooling_cost_3yr,
porting_cost=porting_cost,
total_tco=total_tco,
cost_per_tflop_achieved=cost_per_tflop,
notes=notes,
)


# --- Hardware definitions ---
HARDWARE_CATALOG = {
"h100_sxm5": HardwareSpec(
name="NVIDIA H100 SXM5",
bf16_tflops=989,
hbm_gb=80,
hbm_bandwidth_tbs=3.35,
tdp_watts=700,
purchase_price_usd=30_000,
cloud_hourly_usd=4.10, # AWS p5.48xlarge / 8 GPUs
),
"a100_80gb": HardwareSpec(
name="NVIDIA A100 80GB SXM4",
bf16_tflops=312,
hbm_gb=80,
hbm_bandwidth_tbs=2.0,
tdp_watts=400,
purchase_price_usd=12_000,
cloud_hourly_usd=3.26, # AWS p4de.24xlarge / 8 GPUs
),
"gaudi2": HardwareSpec(
name="Intel Gaudi 2",
bf16_tflops=432,
hbm_gb=96,
hbm_bandwidth_tbs=2.46,
tdp_watts=600,
purchase_price_usd=8_500,
cloud_hourly_usd=2.15, # AWS dl2q.24xlarge estimated / 8 cards
),
"trainium2": HardwareSpec(
name="AWS Trainium 2",
bf16_tflops=3_800,
hbm_gb=96,
hbm_bandwidth_tbs=6.4,
tdp_watts=700,
purchase_price_usd=0, # Cloud-only
cloud_hourly_usd=3.47, # trn2.48xlarge / 16 chips
available_cloud=True,
),
"rtx_4090": HardwareSpec(
name="NVIDIA RTX 4090",
bf16_tflops=330,
hbm_gb=24,
hbm_bandwidth_tbs=1.0,
tdp_watts=450,
purchase_price_usd=1_600,
cloud_hourly_usd=2.10,
),
}


def run_comparison(workload_name: str, model_params_b: float, memory_per_unit_gb: float):
"""Run a TCO comparison for a given workload across hardware options."""

workload = WorkloadSpec(
name=workload_name,
model_params_billions=model_params_b,
training=True,
utilization_fraction=0.7,
batch_size=64,
required_memory_gb=memory_per_unit_gb,
hours_per_year=8_760,
)

# Baseline: NVIDIA team, no porting cost
baseline_config = DeploymentConfig(
on_premise=True,
team_porting_months=0,
engineers_on_project=5,
)

# Alternative hardware: 6 months of porting time
alt_config = DeploymentConfig(
on_premise=True,
team_porting_months=6,
engineers_on_project=5,
engineer_monthly_cost_usd=18_000,
)

calc_baseline = TCOCalculator(baseline_config)
calc_alt = TCOCalculator(alt_config)

print(f"\n{'#'*60}")
print(f"Workload: {workload_name} ({model_params_b}B params)")
print(f"Memory per unit required: {memory_per_unit_gb}GB")
print(f"{'#'*60}")

for hw_name, hardware in HARDWARE_CATALOG.items():
if hardware.purchase_price_usd == 0:
continue # Skip cloud-only for on-premise comparison
cfg = baseline_config if hw_name in ("h100_sxm5", "a100_80gb") else alt_config
calc = TCOCalculator(cfg)
result = calc.calculate(hardware, workload)
print(result.summary())


# Example: 70B model fine-tuning comparison
# 70B FP16 = 140GB, need 2x 80GB GPUs or more for training with optimizer states
run_comparison(
workload_name="70B LLM Fine-tuning",
model_params_b=70,
memory_per_unit_gb=140, # FP16 inference; training needs 4-6x more
)

Ecosystem Risk Matrix


Case Studies: How the Giants Choose

Meta (Custom MTIA) - Meta deployed their Meta Training and Inference Accelerator (MTIA) in 2023 for inference on recommendation models. The key insight: Meta's recommendation models are highly irregular (sparse embedding lookups, dynamic shapes) and do not map well to GPU tensor cores. MTIA was designed specifically for this access pattern. Meta still uses NVIDIA GPUs for LLM training. The lesson: custom silicon makes sense when your workload is significantly different from what GPUs are optimized for, and your scale justifies the ASIC investment.

Microsoft (Maia 100) - Microsoft announced its Maia 100 AI accelerator in 2023 for Azure cloud workloads. Built on TSMC 5nm, targeting training and inference for large language models. Microsoft's motivation: reduce dependence on NVIDIA supply chain, improve margins on Azure AI services. The Maia 100 will run alongside NVIDIA hardware, not replace it - a hedge rather than a full bet.

Google (Internal TPU) - Google has used TPUs for all major internal training since TPU v2 in 2017. Gemini 1.5 and Gemini 2.0 were trained on TPU v5. The key enabler: Google built JAX as a first-class TPU framework and all research code is written in JAX. The XLA compiler optimizes aggressively for TPU hardware. External teams cannot easily replicate this because they cannot write off the 5+ year JAX migration cost.

Amazon (Trainium/Inferentia) - Amazon operates at scale where AWS margins on GPU instances matter. Trainium 2 runs internally for Amazon Alexa and Titan model training. Inferentia 2 runs production inference for Amazon's recommendation and search models. Amazon's advantage: captive workloads where they can invest 12+ months in optimization without worrying about external customer compatibility. The public Neuron SDK is a secondary product - the primary customer is Amazon's own AI teams.


Common Mistakes

:::danger Making the Decision Based on Benchmark Numbers Alone Vendor benchmarks are constructed to show the best possible result for the vendor's hardware. A benchmark might show 3x faster training vs H100 - running a specific model architecture with vendor-optimized code, at an ideal batch size, on a task the hardware was explicitly designed for. Your model, at your batch size, with your team's code quality, on a randomly selected Monday morning when you are debugging a gradient divergence issue, will not reproduce that benchmark. Always benchmark your exact workload with your team's actual code before committing to a hardware platform. :::

:::danger Ignoring the Porting Cost The most common mistake in hardware selection: comparing cloud hourly rates without accounting for engineer time to port the codebase, the months of reduced productivity while the team learns the new toolchain, and the ongoing maintenance cost when new model architectures need to be ported again. A hardware platform that saves 2M/yearincloudcostsbutcosts2M/year in cloud costs but costs 1.5M/year in additional engineering overhead saves only $500k. The math often looks much worse once porting costs are included. Calculate porting cost as months * engineers * monthly_cost and add it to the 3-year TCO. :::

:::warning Underestimating Operator Coverage Gaps "We support PyTorch" does not mean "we support your PyTorch model." Most alternative hardware vendors support the 80% of PyTorch operators that cover 95% of standard model architectures. If your model uses flash attention 3, ring attention for long contexts, a custom mixture-of-experts router, or any non-standard layer, assume it is not supported until proven otherwise. The test: run your exact model in eager mode on the target hardware and check for operator fallbacks. Any operator that falls back to CPU emulation in the critical training path will dominate your runtime. :::

:::warning Conflating Training and Inference Hardware Requirements The optimal hardware for training a model is often not optimal for serving it. H100s are excellent for training but expensive for inference if you are memory-bandwidth-bound serving a 7B model to low-traffic endpoints. Conversely, Inferentia 2 is excellent for high-throughput inference but has no training capability. Evaluate training and inference hardware separately and do not assume the same chip is optimal for both. :::


Presenting the Hardware Decision to Leadership

When a VP or CTO needs to approve a hardware selection:

Lead with outcomes, not specs - "We can ship the model 3 months earlier" lands better than "the H100 has 3.35 TB/s of HBM3 bandwidth." Connect hardware capability to business outcomes.

Show the 3-year TCO, not the sticker price - Hardware decisions look different over 3 years. A 30kH100withmaturesoftwareandzeroportingcostcanbecheaperthana30k H100 with mature software and zero porting cost can be cheaper than a 20k alternative that requires 6 months of porting and ongoing maintenance.

Present the risk matrix explicitly - Leadership deserves to know: "Option A costs 30% less but requires us to learn a new toolchain, and we have found operator coverage gaps for 3 of our model components. Option B costs more but our team can ship immediately." The risk discussion should be explicit, not buried.

Include a pilot plan for alternatives - Never recommend switching entirely from H100 to any alternative hardware without a pilot. Propose: "Run the alternative hardware for 2 months on workload X with team Y. If achieved throughput matches benchmarks within 20% and porting cost stays under $Z, expand to production." This de-risks the decision.


Interview Questions and Answers

Q1: A company wants to move LLM training from A100s to Trainium 2 to reduce cost. What analysis would you perform before making the recommendation?

A1: The analysis has four phases. Phase one: operator coverage audit. Run the training model in PyTorch eager mode on a Trainium 2 instance and identify every operator that falls back to CPU or shows a compilation warning. Any fallback in the forward or backward pass is a red flag requiring investigation. Phase two: benchmark on your actual workload. Not a synthetic benchmark - your model, your batch size, your sequence length, your optimizer, with your training loop. Measure achieved tokens/second or loss/compute, not peak FLOPS. Phase three: TCO calculation including porting cost. Trainium 2 requires Neuron SDK and its custom compiler. Estimate the engineering time to port, validate, and maintain the training pipeline. At 18k/monthperengineer,4monthsof3engineersis18k/month per engineer, 4 months of 3 engineers is 216k in porting cost that must be recovered by the cloud savings. Phase four: risk assessment. What happens if Trainium 2 support falls behind for your model architecture? What is the rollback plan?

Q2: Explain the difference between peak FLOPS and achieved FLOPS, and how this affects hardware comparisons.

A2: Peak FLOPS is the theoretical maximum throughput assuming 100% of compute units are busy doing useful work with perfect memory access patterns and no overhead. It requires the tensor core utilization to be near 1.0, which means matrices must be large enough to fill the hardware, no time spent on non-compute operations, and memory bandwidth does not limit compute. Achieved FLOPS in practice is 30-65% of peak for well-tuned GPU training workloads. For less mature hardware with less optimized software, it can be 20-40% of peak.

The comparison trap: if Hardware A peaks at 1,000 TFLOPS and achieves 50% utilization, and Hardware B peaks at 600 TFLOPS and achieves 65% utilization, they achieve 500 and 390 TFLOPS respectively - Hardware A is 28% faster despite being labeled "1.67x higher peak FLOPS." Always measure achieved throughput on your workload. The ratio of achieved to peak FLOPS is called MFU (Model FLOP Utilization) and well-optimized LLM training runs at 35-55% MFU on H100s with Megatron-LM.

Q3: How do you decide between on-premise hardware and cloud hardware for a new AI project?

A3: The decision hinges on utilization predictability and time horizon. Cloud is favorable when: the workload has variable utilization and you cannot commit to running hardware 24/7; you need hardware for a short-term project (under 18 months); you need to experiment with different hardware types before committing; or you need to scale rapidly and cannot wait for hardware procurement lead times. On-premise is favorable when: utilization will be above 60% continuously for 2+ years (the break-even point for most hardware); you have strict data residency requirements; you need the lowest possible latency to adjacent services; or you need custom network configurations (InfiniBand fabric, custom RoCE setups). The break-even calculation: on-premise TCO over 3 years divided by 26,280 hours gives a per-hour cost. If that is lower than the cloud reserved instance price, on-premise wins - assuming utilization is high.

Q4: What are the three factors that most determine whether a workload fits a non-GPU AI accelerator?

A4: First, operator coverage: does the non-GPU hardware support all operators in the model's computational graph with hardware-accelerated implementations? One fallback to CPU emulation in the critical path can negate all hardware advantages. Second, memory access pattern regularity: custom accelerators (TPUs, Trainium, Gaudi) are optimized for regular, predictable tensor operations. Workloads with dynamic shapes, variable sequence lengths, sparse operations, or complex control flow are harder to optimize on custom silicon. Third, scale and commitment: custom silicon requires significant engineering investment to achieve good utilization. This investment is worthwhile at scale (training runs of weeks on thousands of chips) but wasteful for small-scale, experimental workloads. A workload is a good fit for non-GPU accelerators when it uses standard transformer operations, has regular shapes, and will run for months without architectural changes.

Q5: How would you evaluate whether Google TPU v4 is a better choice than H100 for training a 70B parameter model?

A5: The evaluation requires assessing five factors. Framework alignment: does the team use JAX? If yes, TPUs are first-class citizens and the XLA compiler is heavily optimized for transformer training. If the codebase is PyTorch, the switching cost to JAX is substantial - budget 3-6 months of migration time. Memory configuration: TPU v4 has 32GB HBM per chip, versus H100's 80GB. Training a 70B model requires at least 4-8 TPU v4 chips just for memory, versus 2-4 H100s. This changes the communication topology and cost structure. Communication interconnect: TPU v4's ICI fabric provides 1,200 GB/s per chip in a 3D torus topology, which excels at the all-reduce operations needed for data parallel training at large scale. H100 NVLink is 900 GB/s but only within a single node; inter-node uses InfiniBand at ~400 Gb/s. For very large pod-scale training, TPU interconnect may win. TCO: at 1.35/chiphourreservedversusH100at 1.35/chip-hour reserved versus H100 at ~4/GPU-hour reserved, TPU v4 can be 3x cheaper per chip - but needs 2-4x more chips for the same memory capacity. Net cost comparison requires full system-level analysis. Ecosystem risk: if JAX goes out of favor or TPU v4 is deprecated before training completes, what is the contingency?

Q6: Cerebras CS-3 is advertised at 125,000 TFLOPS. Why would anyone use H100 at 989 TFLOPS instead?

A6: The Cerebras CS-3 peak TFLOPS are achieved on highly sparse computations. The chip contains a wafer-scale 900,000-core processor with 44GB of on-chip SRAM. For dense, standard transformer training at typical sparsity levels (under 50%), achieved throughput is much lower than 125,000 TFLOPS. More importantly, Cerebras CS-3 has extremely limited model support. It requires running on Cerebras hardware with the Cerebras software stack, which supports specific model architectures. Custom models require significant porting effort with Cerebras engineering involvement. The chip costs roughly $2-3M per unit and requires custom liquid cooling infrastructure. The typical customer is a national laboratory or hyperscaler running specific sparse-model workloads at extreme scale. For standard LLM training, the H100's mature ecosystem, standard rack deployment, and proven toolchain make it dramatically more practical despite the nominal TFLOPS disadvantage. The lesson: peak FLOPS specifications that are 100x higher than competition should trigger skepticism, not excitement.


Memory Bandwidth: Why It Dominates Inference More Than Compute

For large language model inference specifically, the hardware metric that matters most is often memory bandwidth, not compute TFLOPS. Understanding why changes how you evaluate hardware options.

During LLM inference, the compute pattern is severely memory-bound for small batch sizes. For a single token generation step, every weight in every layer must be loaded from HBM exactly once. For a 70B parameter model at BF16, that is 140GB of data to load from memory per forward pass.

The time to load this data at various HBM bandwidths:

  • H100 SXM5 at 3.35 TB/s: 140GB / 3,350 GB/s = 42 milliseconds
  • A100 SXM4 at 2.0 TB/s: 140GB / 2,000 GB/s = 70 milliseconds
  • Groq LPU at 80 TB/s (SRAM): 140GB / 80,000 GB/s = 1.75 milliseconds

This is why Groq is so fast for single-user LLM inference: the weights live in on-chip SRAM at bandwidth 20x higher than H100's HBM3. The compute units are not working harder - they are simply never waiting for data.

At larger batch sizes (batch 32+), the compute cost becomes dominant and the bandwidth advantage of Groq diminishes. For high-throughput serving where you batch 32-64 requests simultaneously, H100 becomes competitive or superior because you amortize the weight load cost across many outputs.

The formula for when a model is memory-bandwidth-bound vs compute-bound, using arithmetic intensity II:

I=FLOPs per forward passbytes loaded from memory per forward passI = \frac{\text{FLOPs per forward pass}}{\text{bytes loaded from memory per forward pass}}

Ithreshold=Peak TFLOPSPeak HBM bandwidth (TB/s)I_{\text{threshold}} = \frac{\text{Peak TFLOPS}}{\text{Peak HBM bandwidth (TB/s)}}

For H100: Ithreshold=989/3.35=295I_{\text{threshold}} = 989 / 3.35 = 295 FLOP/byte

For a 70B model at batch 1: I=2×70×109140×109=1I = \frac{2 \times 70 \times 10^9}{140 \times 10^9} = 1 FLOP/byte (far below threshold - memory bound)

For a 70B model at batch 128: I=128I = 128 FLOP/byte (still memory bound on H100, but 128x more output per memory load than batch 1)

For batch size where I>295I > 295: compute bound. This requires batch size around 300+ for 70B models - unusual in practice for interactive serving.

The implication: for LLM inference serving latency-sensitive users, buy memory bandwidth, not TFLOPS.


The Groq Architecture: Why It Is Different

Groq's LPU deserves a deeper look because it represents a genuinely different hardware philosophy that is relevant to understanding when custom silicon wins.

Groq's key observation: transformer inference has a predictable, regular computation graph. Every token generation step executes the exact same operations in the exact same order. This determinism is something GPUs do not exploit - GPUs use dynamic scheduling to handle irregular workloads, but that scheduling adds overhead and reduces memory access predictability.

Groq's response: build a chip with no caches, no dynamic scheduling, no speculative execution. Instead, the compiler statically schedules every memory access and every computation for every clock cycle. The instruction stream is precomputed at compile time - the chip executes a sequence like a play being performed rather than improvising each scene.

The result: 80 TB/s of internal SRAM bandwidth (weights are distributed across 230MB of SRAM on each chip), deterministic latency, and no memory access jitter. At batch size 1, Groq achieves ~500 tokens/second for Llama 3 70B, versus ~80 tokens/second on an H100 at batch 1.

The trade-off: Groq's compiler requires minutes to compile a model and does not support dynamic shapes, dynamic control flow, or arbitrary custom operators. If your model uses non-standard attention patterns or has variable-length routing, Groq cannot support it. The chip is excellent for stable, production models and poor for research and experimentation.

This maps to the general principle for custom silicon: determinism enables optimization. The more unpredictable your workload, the less custom silicon can exploit its advantages.


Multi-Chip Scaling: Where the Architecture Wars Are Won

At scale (training > 70B parameters on hundreds of chips), the inter-chip communication architecture determines training efficiency as much as single-chip compute.

The problem: for data-parallel training with gradient synchronization, every chip must share its gradient updates with all other chips after each backward pass. For a 70B model with BF16 gradients, that is 140GB of gradients per step. With 512 chips, each chip must receive 139 chips' worth of gradients.

An all-reduce communication of 140GB across 512 chips using the ring-allreduce algorithm transmits 2×N1N×D2×140GB2 \times \frac{N-1}{N} \times D \approx 2 \times 140\text{GB} total bytes per chip. At 900 GB/s (NVLink 4.0 within a node, ~400 Gb/s between nodes via InfiniBand), the inter-node all-reduce dominates.

This is where TPU v4's ICI fabric shows its advantage for very large runs. The ICI is a 3D torus with 1,200 GB/s per chip in all-to-all configuration. At 4,096-chip scale, TPU v4 ICI sustains near-linear scaling efficiency because every chip can communicate with every other chip with equal bandwidth - no fat-tree InfiniBand hierarchy with oversubscription.

For NVIDIA multi-node training:

  • Intra-node: 8x H100 connected by NVLink at 900 GB/s - excellent
  • Inter-node: HDR InfiniBand at 400 Gb/s per port, 8 ports per node = 3.2 Tb/s = 400 GB/s - much lower than NVLink

At 64-node (512 H100) scale, the inter-node communication becomes the primary bottleneck. Teams solve this with:

  1. Tensor parallelism within nodes (use NVLink), pipeline parallelism between nodes (use InfiniBand more sparingly)
  2. Gradient accumulation to amortize communication cost over multiple microbatches
  3. Async pipeline flushing to overlap compute and communication (Megatron-LM 1F1B schedule)

The hardware selection implication: if your training run requires 1,024+ chips, evaluate the inter-chip fabric as carefully as single-chip performance. TPU's ICI or NVIDIA's NVLink fabric scale differently and the crossover point depends on your specific model parallelism strategy.


Model FLOP Utilization (MFU): The Real Performance Metric

When evaluating hardware alternatives, do not compare vendor TFLOPS numbers directly. Compare achievable MFU (Model FLOP Utilization) - the fraction of peak TFLOPS that a real training job achieves.

MFU=Measured tokens/second×FLOPs per tokenNchips×Peak TFLOPS per chip\text{MFU} = \frac{\text{Measured tokens/second} \times \text{FLOPs per token}}{N_\text{chips} \times \text{Peak TFLOPS per chip}}

For a 70B model, FLOPs per token forward pass = 2×P=2×70×109=1402 \times P = 2 \times 70 \times 10^9 = 140 GFLOPs. Including backward pass: 3×140=420\approx 3 \times 140 = 420 GFLOPs per token.

Typical MFU values in production:

HardwareFrameworkTypical Training MFUNotes
H100 SXM5Megatron-LM38-55%FlashAttention 2 required
H100 SXM5PyTorch FSDP28-40%Communication overhead higher
A100 SXM4Megatron-LM35-50%Mature, well-optimized
TPU v4JAX + MaxText45-60%Excellent ICI efficiency
Intel Gaudi 2SynapseAI25-40%Less optimized for Megatron
AWS Trainium 2Neuron SDK30-50%Standard transformers only
def calculate_mfu(
tokens_per_second: float,
model_params_billions: float,
n_chips: int,
chip_peak_tflops: float,
include_backward: bool = True
) -> float:
"""
Calculate Model FLOP Utilization (MFU).

Args:
tokens_per_second: Measured training throughput
model_params_billions: Model parameter count in billions
n_chips: Number of accelerator chips in the system
chip_peak_tflops: Peak BF16 TFLOPS per chip
include_backward: Whether to include backward pass in FLOPs estimate

Returns:
MFU as a fraction (0.0 to 1.0)

Example:
# 512 H100s training LLaMA 70B at 800 tokens/second
mfu = calculate_mfu(800, 70, 512, 989)
# Expected: ~0.35 (35% MFU)
"""
# FLOPs per token for transformer: approximately 2 * P for forward,
# 2P for backward (gradient), P for optimizer step = ~6P per step
# In practice, approximate as 6 * params for training
flops_per_token = 6e9 * model_params_billions if include_backward else 2e9 * model_params_billions

# Total system peak TFLOPS
system_peak_tflops = n_chips * chip_peak_tflops

# MFU = achieved TFLOPS / peak TFLOPS
achieved_tflops = (tokens_per_second * flops_per_token) / 1e12
mfu = achieved_tflops / system_peak_tflops

return mfu


# Examples for different hardware configurations
configs = [
# (description, tokens/sec, params_B, n_chips, peak_tflops)
("512x H100, LLaMA 70B, Megatron-LM", 800, 70, 512, 989),
("256x A100, LLaMA 70B, FSDP", 200, 70, 256, 312),
("1024x TPU v4, LLaMA 70B, JAX", 1200, 70, 1024, 275),
("256x Gaudi 2, LLaMA 7B, SynapseAI", 8000, 7, 256, 432),
]

for desc, tps, params, chips, peak in configs:
mfu = calculate_mfu(tps, params, chips, peak)
cost_per_day = chips * 4.0 * 24 # rough H100 cloud cost for comparison
print(f"{desc}")
print(f" MFU: {mfu:.1%} | Throughput: {tps:,} tok/s")
print()

MFU is the metric to demand from any vendor claiming their hardware outperforms H100. "Our chip is 2x faster than H100" only means something if they can demonstrate 2x higher MFU on your actual model architecture with your actual software stack.


The Build vs Buy vs Cloud Decision

One dimension often missing from hardware discussions: whether to own hardware at all, versus using cloud, versus partnering with a hardware vendor for co-design.

Cloud (pay-as-you-go) is optimal when:

  • Utilization is unpredictable or bursty (training experiments, research)
  • You need access to the latest hardware before it is available for purchase
  • You want zero capital expenditure (startup in seed stage)
  • You need geo-distributed inference with dynamic scaling
  • Overhead: cloud premium over on-premise is typically 2-3x over 3 years at high utilization, but zero capex risk

On-premise purchase is optimal when:

  • Utilization will exceed 60% continuously for 3+ years
  • You have data sovereignty requirements
  • You want the lowest possible networking latency to adjacent storage or services
  • You can negotiate GPU allocation from NVIDIA or AMD (requires significant purchase volume)
  • Risk: hardware procurement takes 3-12 months for large orders; utilization may be lower than projected

Reserved instances (cloud) bridges both worlds:

  • 1-year or 3-year cloud reservations can reduce hourly rate by 40-60% vs on-demand
  • Requires capacity planning but not capital expenditure
  • Returns hardware at contract end - no residual value risk
  • AWS Savings Plans, GCP Committed Use Discounts, Azure Reserved Instances all offer this

Co-design partnerships are emerging for hyperscale users:

  • Meta partnered with NVIDIA and AMD for custom chip configurations
  • Google designed TPU architecture jointly with DeepMind researchers
  • Microsoft co-developed Maia 100 with AMD foundry support
  • Requires $50M+ annual hardware spend to access vendor engineering attention

For most ML organizations, the realistic choices are cloud (flexibility) versus cloud reserved instances (cost efficiency at scale). On-premise becomes relevant at $5M+/year hardware spend. Custom silicon co-design is only realistic for 10+ billion dollar revenue companies.


Summary

Hardware selection for AI is a multi-dimensional problem that cannot be reduced to TFLOPS comparisons. The framework:

  1. Workload fit first - match hardware architecture to your computation pattern before any other analysis. Training vs inference, model size, batch size, and operator profile all constrain the viable options. Memory-bandwidth-bound inference workloads need different hardware than compute-bound training runs.

  2. Benchmark your actual workload - never rely on vendor benchmarks. Run your model, your batch size, your training loop on the candidate hardware and measure real throughput. Report MFU (Model FLOP Utilization), not TFLOPS.

  3. Total cost over 3 years - include hardware (purchase or cloud reservation), power, cooling, porting cost, and ongoing maintenance. Porting cost alone can eliminate the financial advantage of cheaper hardware.

  4. Ecosystem maturity is the silent killer - operator coverage gaps, immature debugging tools, and small communities multiply the cost of every problem you encounter. The more novel the hardware, the larger this risk.

  5. Team capability is fixed in the short term - switching hardware platforms costs months of reduced productivity. Factor this explicitly into every comparison.

  6. Scale changes the optimal architecture - a chip that is optimal at 8-GPU scale may not be optimal at 1,024-chip scale. Evaluate inter-chip communication bandwidth and topology, not just single-chip specs, for large training runs.

The default answer remains NVIDIA GPUs for most workloads - not because they are always technically optimal, but because the ecosystem advantage compounds over time and MFU on mature CUDA software is higher than on newer stacks. Alternative hardware becomes compelling when cost savings are large enough to absorb porting costs, when the workload is stable enough that operator coverage gaps are manageable, and when the team has the capacity to invest in the transition.

The exceptions are real: Google TPU for JAX-native shops building at pod scale, Inferentia 2 for high-throughput AWS inference on standard transformer architectures, Apple Silicon for on-device deployment, and Groq for the rare use case requiring sub-10ms LLM time-to-first-token. Each has a specific niche where it genuinely wins. Outside that niche, H100 remains the safe choice that delivers predictable results and lets your team focus on the model rather than the infrastructure.

© 2026 EngineersOfAI. All rights reserved.