Intel Gaudi and Habana Labs
Reading time: ~35 min · Interview relevance: Medium · Target roles: ML Engineer, AI Infrastructure, Cost-Conscious Platform Teams
The $20,000 Question
Your team just got budget approval for a 64-card GPU cluster to fine-tune a 70B parameter LLM. The procurement spreadsheet has two rows. Row one: 64 NVIDIA H100 SXM5 cards at roughly 1.92 million total. Row two: 64 Intel Gaudi 2 cards at roughly 640,000 total.
That's $1.28 million sitting on the table. The CTO wants to know why you would not use the cheaper option. You open your browser, search for "Gaudi 2 training performance," and find a mix of Intel marketing benchmarks and sparse third-party results. The H100 is 2.5x more expensive. Is it also 2.5x faster? If not, you have a procurement problem.
This is the real question Gaudi forces you to answer. Not "is it as fast as H100?" - it is not. The question is whether Gaudi's price-to-performance ratio, combined with its on-chip networking, justifies the ecosystem migration cost. For some workloads, the math works out clearly. For others, the CUDA ecosystem lock-in is so deep that switching would cost more in engineering time than the hardware savings.
By the end of this lesson you will be able to run that analysis properly. You will understand the Gaudi 2 and Gaudi 3 architectures - the Tensor Processor Cores, the Matrix Multiplication Engine, the 96GB HBM2e memory, and crucially the 24 on-chip 100GbE RoCE ports that let you build a 512-card cluster without a single InfiniBand switch. You will know how to write and run training jobs using habana_frameworks.torch, what PyTorch ops have gaps on HPU, and how to size clusters for LLM training using Gaudi's networking model. The $1.28 million question will have a real answer.
Why Intel Bought Habana: A Strategic Acquisition in Desperation
To understand Gaudi you need to understand why Intel spent $2 billion to buy a three-year-old Israeli startup in December 2019.
Intel's business model had always been built on general-purpose CPUs. The data center was their kingdom - Xeon processors were in virtually every server worldwide. Then, between 2012 and 2019, something broke that model permanently. Deep learning happened. And deep learning did not run well on CPUs.
NVIDIA, whose GPUs were originally designed for graphics rendering, turned out to have exactly the right architecture for neural network training: thousands of parallel compute units, high memory bandwidth, and a programming model (CUDA, released 2007) that let researchers express parallel operations naturally. By 2019, NVIDIA had over 90% market share in data center AI accelerators. Not because their hardware was perfect for AI - it was not, GPUs carry a lot of graphics-specific overhead - but because CUDA had twelve years of ecosystem investment behind it. Libraries, tooling, tutorials, forum posts, and a generation of ML engineers who knew nothing else.
Intel tried to compete with their own efforts. The Nervana Neural Network Processor (NNP), acquired when Intel bought Nervana Systems in 2016 for $408 million, was supposed to be their answer. It was cancelled in 2020 before reaching meaningful production deployment. Intel's Xe GPU line had AI compute features but no serious deep learning ecosystem.
Habana Labs was different. Founded in 2016 by Avigdor Willenz (who had previously sold Galileo Technologies to Marvell) and Ran Halutz, the company had two products: Goya (2018), an inference chip, and Gaudi (2019), a training chip. Gaudi was specifically designed for distributed deep learning training from the ground up. It had on-chip networking. It had a programmable compute architecture. It had DRAM bandwidth competitive with NVIDIA's V100.
Intel paid $2 billion for Habana in December 2019 - six months after Google introduced the TPU v3. The acquisition was not about Habana's current products. It was about stopping the bleeding. Intel needed an AI training story before their data center business collapsed entirely under NVIDIA's dominance.
The resulting product line - Gaudi 2 (2022) and Gaudi 3 (2024) - represents that investment maturing. These are serious chips built for one purpose.
Historical Chip Timeline
| Chip | Year | Peak BF16 | Memory | Bandwidth | Notes |
|---|---|---|---|---|---|
| Habana Goya | 2018 | INT8 inference only | 32GB DRAM | 600 GB/s | Inference-only, pre-Intel |
| Habana Gaudi | 2019 | 125 TFLOPS | 32GB HBM | 1 TB/s | First training chip, pre-Gaudi 2 |
| Intel Gaudi 2 | 2022 | 865 TFLOPS BF16 | 96GB HBM2e | 2.45 TB/s | First major Intel-era chip |
| Intel Gaudi 3 | 2024 | 1835 TFLOPS BF16 | 128GB HBM2e | 3.7 TB/s | Direct H100 competitor |
For context: H100 SXM5 delivers 989 TFLOPS in FP8 (approximately 494 TFLOPS BF16). Gaudi 3 at 1835 TFLOPS BF16 is nominally 3.7x faster than H100 in BF16 on paper. Real-world training throughput is significantly closer due to memory access patterns, operator coverage, and software maturity - but Gaudi 3 is a genuine competitor in raw arithmetic, not a compromise chip.
Gaudi 2 Architecture: Inside the Chip
The Two Core Compute Engines
Gaudi 2 has two distinct types of compute hardware. Understanding this distinction is essential for understanding what Gaudi is fast at and where it struggles.
Tensor Processor Cores (TPC) - Gaudi 2 has 24 TPC cores. Each TPC is a programmable VLIW (Very Long Instruction Word) SIMD processor. "Programmable" is the key word. Unlike fixed-function accelerators, TPC cores can run arbitrary custom kernels written in a C-like language called TPC-C. This is Gaudi's answer to CUDA custom ops: if a PyTorch operation does not have a native HPU implementation, you can write a TPC kernel for it instead.
TPC cores handle element-wise operations, activation functions, normalizations, and any computation that does not fit the matrix multiply pattern. They are VLIW, meaning the compiler explicitly schedules multiple operations into a single instruction - there is no out-of-order execution hardware. The programmer (or compiler) is responsible for instruction-level parallelism. This makes TPCs highly efficient when the compiler can pack instructions well, but more sensitive to irregular compute patterns.
Matrix Multiplication Engine (MME) - The MME is fixed-function hardware optimized specifically for GEMM (general matrix multiply). GEMM is the dominant operation in transformer training: Q, K, V projections, attention score computation, feed-forward layers, embedding lookups - all reduce to matrix multiplies. The MME on Gaudi 2 delivers the bulk of the 865 TFLOPS BF16 number.
The MME operates on BF16 (bfloat16) and FP32 natively. BF16 is the preferred training format: same exponent range as FP32 (8-bit exponent), lower mantissa precision (7 bits vs 23 bits for FP32), but empirically sufficient for most training jobs with loss scaling. The wide exponent range means BF16 handles the dynamic range of gradients without the stability issues that FP16 causes.
Gaudi 2 Compute Units
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
24x TPC Cores (Programmable)
┌─────────────────────────┐
│ VLIW SIMD Processor │
│ Custom kernel support │
│ Activation functions │
│ Element-wise ops │
│ Normalization layers │
└─────────────────────────┘
Matrix Multiplication Engine (MME)
┌─────────────────────────┐
│ Fixed-function GEMM │
│ BF16 / FP32 │
│ 865 TFLOPS peak BF16 │
│ Linear, Attention, │
│ Embedding operations │
└─────────────────────────┘
96GB HBM2e @ 2.45 TB/s
Memory: 96GB HBM2e at 2.45 TB/s
Gaudi 2 ships with 96GB of HBM2e (High Bandwidth Memory 2e). For comparison, H100 SXM5 has 80GB HBM3 at 3.35 TB/s. Gaudi 2 has more total memory but lower bandwidth.
The implications for LLM training:
- A LLaMA 3 70B model in BF16 requires approximately 140GB for weights alone. At 96GB per Gaudi 2 card, you need at least 2 cards using tensor parallelism, which puts the full model in 192GB. The H100 at 80GB also requires 2 cards for the same reason.
- Gradient and optimizer state memory (AdamW needs 3x model size for fp32 master weights, gradients, and moment estimates) means you need pipeline or ZeRO-3 parallelism regardless of hardware.
- Gaudi 2's 2.45 TB/s vs H100's 3.35 TB/s means GEMM operations are roughly 73% of H100 bandwidth efficiency when memory-bandwidth-bound. For compute-bound operations (large batch sizes), the TFLOPS number matters more than bandwidth.
For Gaudi 2 running a 7B BF16 model:
This is the theoretical ceiling. Real throughput with batch size 1 inference runs at 60-80% of this due to overhead.
The Networking Advantage: 24 On-Chip 100GbE RoCE Ports
This is where Gaudi 2 genuinely differentiates from NVIDIA. Read this carefully because it fundamentally changes how you design a training cluster.
NVIDIA's approach to multi-GPU communication has two layers:
- Within a server: NVLink connects GPUs on the same node at 600 GB/s bidirectional (H100 NVLink 4.0)
- Between servers: InfiniBand HDR (400 Gb/s) or NDR (800 Gb/s) switches connect nodes
The InfiniBand switches are expensive. A 400-port InfiniBand NDR switch costs in the range of 200,000. Building a 512-GPU cluster with fat-tree topology requires multiple tiers of switches, and the networking can cost as much as the GPUs themselves.
Gaudi 2's answer: put the network directly on the chip. Each Gaudi 2 has 24 ports of 100GbE RoCE v2. RoCE is "RDMA over Converged Ethernet" - it gives you InfiniBand-class latency and RDMA semantics over standard Ethernet infrastructure. The 24 ports give Gaudi 2 a total of 2.4 Tbps of inter-chip bandwidth per card.
In a standard 8-card Gaudi 2 server (HLS-2 system), the 8 cards connect to each other directly through these ports. Scale-out between servers uses standard Ethernet switches - not InfiniBand. A 100GbE Ethernet switch at the same port count costs roughly 10-20x less than InfiniBand.
NVIDIA Scale-Out (512 GPUs) Gaudi 2 Scale-Out (512 cards)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Server [8x H100] Server [8x Gaudi 2]
NVSwitch (intra-server) On-chip RoCE (intra/inter)
| |
InfiniBand NDR switch Ethernet 100GbE switch
$100k-200k each $5k-20k each
| |
Fat-tree fabric Flat or fat-tree Ethernet
Multiple switch tiers Standard L2/L3 fabric
The practical consequence: a 64-card Gaudi 2 cluster might save $500,000-800,000 in networking hardware compared to the equivalent InfiniBand fabric for NVIDIA GPUs. This significantly improves the total cost of ownership calculation.
Gaudi 2 System Configuration: HLS-2
Intel's standard Gaudi 2 server platform is called HLS-2 (Habana Large Scale system, generation 2). Key specs:
- 8x Intel Gaudi 2 cards per server
- 2x Intel Xeon processors (host CPU)
- 512GB DDR5 system memory
- PCIe Gen5 host interface
- 8x 100GbE QSFP28 external ports per Gaudi 2 (8 of the 24 ports exposed externally, remaining 16 used for intra-server connectivity)
The 16 intra-server ports on each card form a fully-connected mesh between the 8 cards in the server. All-reduce within a server runs at 1.6 Tbps aggregate. The 8 external ports per card give 800 Gbps per card for cross-server communication.
Gaudi 3 Architecture: 2024's Answer to H100
Gaudi 3, released in 2024, is a substantial jump over Gaudi 2. Intel describes it as roughly 2x Gaudi 2 in training throughput.
Compute Scaling
| Parameter | Gaudi 2 | Gaudi 3 |
|---|---|---|
| TPC cores | 24 | 128 |
| MME slices | 32 | 64 |
| BF16 TFLOPS | 865 | 1835 |
| FP8 TFLOPS | N/A | ~3670 (estimated) |
| HBM capacity | 96GB HBM2e | 128GB HBM2e |
| Memory bandwidth | 2.45 TB/s | 3.7 TB/s |
| Network ports | 24x 100GbE | 24x 200GbE |
| Total network bandwidth | 2.4 Tbps | 4.8 Tbps |
The 128 TPC cores on Gaudi 3 represent 5.3x the programmable compute of Gaudi 2. This matters because the TPC cores run everything that is not a matrix multiply - and modern transformer architectures have increasingly complex operations (RoPE positional embeddings, FlashAttention variants, MoE routing) that live in TPC territory.
The upgrade to 24x 200GbE ports doubles the scale-out bandwidth. For a 4096-card Gaudi 3 cluster (which Intel has demoed), this means the interconnect fabric scales without InfiniBand at bandwidths that approach NVLink performance at rack scale.
FP8 Training on Gaudi 3
Gaudi 3 introduces FP8 training support, matching H100. FP8 is important because it roughly doubles throughput over BF16 for the same hardware while maintaining training stability with appropriate loss scaling and amax history tracking.
Intel reports Gaudi 3 achieves comparable throughput to H100 for LLaMA 2 70B training with FP8. Independent benchmarks from cloud providers running Gaudi 3 show 85-95% of H100 throughput at 60-70% of the cost.
SynapseAI Software Stack
The software story is where Gaudi's biggest challenges and biggest investments collide. CUDA has 17+ years of ecosystem. SynapseAI is trying to compress that development in a few years.
Device Model: HPU (Habana Processing Unit)
Gaudi exposes itself to PyTorch as a device type called hpu. This parallels how CUDA exposes as cuda. The device model is familiar:
import torch
import habana_frameworks.torch.core as htcore
# Check availability
print(torch.hpu.is_available()) # True if Gaudi drivers present
print(torch.hpu.device_count()) # Number of Gaudi cards in system
# Move tensors to HPU
device = torch.device("hpu")
x = torch.randn(1024, 1024).to(device)
y = torch.randn(1024, 1024).to(device)
z = x @ y # Matrix multiply on MME
# Move back to CPU
result = z.to("cpu")
print(result.shape)
Lazy Execution Mode
One critical difference from CUDA: Gaudi uses lazy execution by default. This is similar to JAX's deferred computation or PyTorch XLA's approach.
In eager execution (CUDA default): each PyTorch op runs immediately on the device.
In lazy execution (Gaudi default): PyTorch ops are recorded into a computation graph. The graph is compiled and dispatched to the chip only when you explicitly call htcore.mark_step() or when a tensor value is needed on the CPU.
Why lazy? Because the SynapseAI compiler can fuse operations, optimize memory layout, and schedule execution across TPC and MME more effectively when it sees the full graph rather than op-by-op.
import torch
import habana_frameworks.torch.core as htcore
device = torch.device("hpu")
model = MyModel().to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
for batch in dataloader:
inputs = batch["input_ids"].to(device)
labels = batch["labels"].to(device)
outputs = model(inputs)
loss = loss_fn(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# CRITICAL: mark_step() tells SynapseAI compiler to dispatch
# the accumulated graph. Without this, nothing executes.
htcore.mark_step()
# Reading loss forces another mark_step implicitly,
# but explicit is better for performance tuning
print(f"Loss: {loss.item()}")
:::warning Forgetting mark_step() Stalls Training
If you forget htcore.mark_step(), operations accumulate indefinitely in the lazy graph. Eventually something triggers a forced dispatch (like loss.item() which reads a value to CPU), but by then the graph is enormous and compilation is slow. Call mark_step() at the end of every training step.
:::
HPU-Fused Optimizers
Standard PyTorch optimizers (Adam, SGD) have many small element-wise operations per parameter update. On a CPU or GPU in eager mode, these run as individual kernels. On HPU, you want them fused into single TPC kernel dispatches.
Intel provides HPU-fused versions of common optimizers:
from habana_frameworks.torch.hpex.optimizers import FusedAdamW, FusedSGD
# Drop-in replacement for AdamW - fused on HPU
optimizer = FusedAdamW(
model.parameters(),
lr=1e-4,
betas=(0.9, 0.999),
eps=1e-8,
weight_decay=0.01
)
# Or SGD
optimizer = FusedSGD(
model.parameters(),
lr=1e-2,
momentum=0.9,
weight_decay=1e-4
)
The fused optimizers can be 3-5x faster than the unfused equivalents on HPU because they eliminate the overhead of dispatching hundreds of small kernels per optimizer step.
Complete Training Script Example
Here is a full training script for a transformer model on Gaudi, demonstrating the key patterns:
import os
import torch
import habana_frameworks.torch.core as htcore
import habana_frameworks.torch.distributed.hccl as hccl
from habana_frameworks.torch.hpex.optimizers import FusedAdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch.utils.data import DataLoader, Dataset
# Initialize distributed training if multi-card
def setup_distributed():
if "RANK" in os.environ:
import torch.distributed as dist
dist.init_process_group(backend="hccl") # Gaudi uses hccl, not nccl
local_rank = int(os.environ["LOCAL_RANK"])
torch.hpu.set_device(local_rank)
return local_rank
return 0
def train():
local_rank = setup_distributed()
device = torch.device(f"hpu:{local_rank}")
# Load model - same as any PyTorch model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-8B",
torch_dtype=torch.bfloat16, # BF16 for Gaudi efficiency
)
model = model.to(device)
# Wrap in DDP if distributed
if torch.distributed.is_initialized():
from torch.nn.parallel import DistributedDataParallel as DDP
model = DDP(model, device_ids=[local_rank])
# HPU-fused optimizer
optimizer = FusedAdamW(
model.parameters(),
lr=2e-5,
betas=(0.9, 0.95),
weight_decay=0.1
)
# Gradient scaler for BF16 (optional - BF16 usually stable without)
# HPU supports torch.cuda.amp.GradScaler via the compatibility layer
scaler = torch.cuda.amp.GradScaler(enabled=False) # Disable for BF16
dataset = MyDataset(...)
dataloader = DataLoader(dataset, batch_size=8, pin_memory=True)
model.train()
for step, batch in enumerate(dataloader):
input_ids = batch["input_ids"].to(device)
attention_mask = batch["attention_mask"].to(device)
labels = batch["labels"].to(device)
# Forward pass
with torch.autocast(device_type="hpu", dtype=torch.bfloat16):
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask,
labels=labels
)
loss = outputs.loss
# Backward
optimizer.zero_grad(set_to_none=True)
loss.backward()
# Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
# ALWAYS call mark_step after the optimizer step
htcore.mark_step()
if step % 10 == 0:
# loss.item() triggers a sync - use sparingly
print(f"Step {step}: loss = {loss.item():.4f}")
htcore.mark_step() # After item() call
if __name__ == "__main__":
train()
DeepSpeed Integration for Multi-Gaudi Training
For large model training across multiple Gaudi cards or servers, Intel maintains a fork of DeepSpeed with HPU support:
# deepspeed_config.json
{
"train_batch_size": 512,
"gradient_accumulation_steps": 4,
"bf16": {
"enabled": true
},
"zero_optimization": {
"stage": 3,
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true
},
"steps_per_print": 100
}
import deepspeed
import habana_frameworks.torch.core as htcore
# DeepSpeed with HPU backend - same API as CUDA
model_engine, optimizer, _, _ = deepspeed.initialize(
model=model,
config="deepspeed_config.json"
)
for batch in dataloader:
inputs = batch.to(model_engine.local_rank)
loss = model_engine(inputs)
model_engine.backward(loss)
model_engine.step()
htcore.mark_step() # Still required
Launch command:
# Single server, 8x Gaudi 2
deepspeed --num_gpus 8 train.py --deepspeed deepspeed_config.json
# Multi-server via mpirun
mpirun -n 64 --hostfile hostfile \
-x MASTER_ADDR=10.0.0.1 \
-x MASTER_PORT=29500 \
python train.py --deepspeed deepspeed_config.json
Architecture Diagram
Scale-Out Architecture: Building a Gaudi 2 Cluster
Performance vs H100: Honest Numbers
The honest performance comparison table is harder to build than Intel's marketing benchmarks suggest. Real performance depends on batch size, model architecture, sequence length, and whether your workload exercises primarily matrix multiply (MME-bound) or custom ops (TPC-bound).
| Workload | Gaudi 2 Throughput | H100 SXM5 Throughput | Gaudi 2 / H100 Ratio | Price Ratio |
|---|---|---|---|---|
| LLaMA 2 7B, BF16 training | ~620 tokens/sec/card | ~820 tokens/sec/card | 0.76x | 0.33x |
| LLaMA 2 70B, BF16 ZeRO-3 | ~85 tokens/sec/8-card | ~140 tokens/sec/8-card | 0.61x | 0.33x |
| BERT large fine-tuning | ~1200 samples/sec | ~1600 samples/sec | 0.75x | 0.33x |
| Stable Diffusion training | ~95 images/sec | ~140 images/sec | 0.68x | 0.33x |
The pattern: Gaudi 2 delivers roughly 60-80% of H100 throughput at approximately 33% of the price. On a pure dollar-per-TFLOP basis, Gaudi 2 wins clearly. On dollar-per-trained-model, the math depends on the specific workload and how close to the 60% or 80% end you land.
Gaudi 3 narrows the gap significantly:
| Workload | Gaudi 3 Throughput | H100 SXM5 Throughput | Gaudi 3 / H100 Ratio |
|---|---|---|---|
| LLaMA 3 70B, BF16 | comparable | baseline | ~0.9x |
| LLaMA 3 405B, FP8 | comparable | baseline | ~0.85-0.95x |
At 85-95% of H100 throughput at 60-70% of H100 price, Gaudi 3 makes a compelling TCO argument for large-scale training.
The Ecosystem Gap: What CUDA Has That SynapseAI Lacks
This is the most important section for anyone seriously evaluating Gaudi for production training.
What Works Well
- Standard PyTorch operators (linear, conv, attention, normalization layers)
- Transformers library models (BERT, GPT-2, LLaMA, T5, ViT via Hugging Face)
- DeepSpeed ZeRO stages 1, 2, 3
- FSDP (Fully Sharded Data Parallel)
- BF16 mixed precision training
- Gradient checkpointing
- torch.compile (limited support)
What Has Gaps or Does Not Work
- CUDA custom ops: if your model uses
torch.ops.my_extension.my_cuda_kernel(), this does not run on HPU. You need a TPC kernel equivalent or a workaround. - FlashAttention 2 and 3: the Tri Dao CUDA implementation does not run on HPU. Intel has its own optimized attention kernel in SynapseAI, but it is not a drop-in import.
- Triton kernels: Triton compiles to CUDA PTX. Triton is not supported on HPU (as of 2024).
- vLLM, TensorRT-LLM: these are NVIDIA-specific inference engines.
- bitsandbytes: quantization library uses CUDA kernels. Not available on HPU.
- Some experimental PyTorch features: dynamo tracing, torch.export - limited support.
Migration Effort Estimation
For a standard Hugging Face-based training pipeline with no custom ops: 1-3 days migration effort. The main changes are device targeting and optimizer swaps.
For a research codebase with custom CUDA kernels, Triton ops, or CUDA-specific libraries: weeks to months. Each custom op needs a TPC kernel replacement or elimination. This cost can easily exceed the hardware savings.
:::danger No CUDA Support - Assess Before Committing Gaudi does not run CUDA code. There is no compatibility layer that translates CUDA to TPC. Before purchasing Gaudi hardware, audit your training code for:
- Custom CUDA extensions (
load_inline,.cufiles) - Triton kernels
- Libraries with CUDA backends (bitsandbytes, flash-attn, apex)
- vLLM or TensorRT inference paths
Replacing each of these is a real engineering project. Calculate the migration cost before calculating the hardware savings. :::
When Gaudi Makes Sense
The decision framework:
Gaudi is a strong choice when:
- You are running standard Transformer training with Hugging Face models
- You use DeepSpeed or FSDP for distributed training (both well-supported)
- You are cost-sensitive and need to maximize dollar-per-trained-model
- You are building a cluster from scratch and can avoid InfiniBand costs
- You have internal infrastructure engineers who can manage the ecosystem gaps
- You are running at Intel Cloud (OCI, AWS have Gaudi instances) and want cost savings
Gaudi is a poor choice when:
- Your code has significant custom CUDA ops, Triton kernels, or CUDA libraries
- You need vLLM or TensorRT-LLM for production inference
- Your team has deep CUDA expertise and no time to learn HPU patterns
- You need the absolute lowest latency for inference serving
- You are using cutting-edge techniques published as "CUDA only" code
Running Gaudi on AWS and Intel Tiber Cloud
You do not need to buy Gaudi hardware to evaluate it. Both AWS and Intel's own cloud offer Gaudi instances.
AWS DL1 instances (Gaudi 1) - older generation, available but not recommended for new work.
AWS DL2q instances - 8x Gaudi 2 per instance. Available in us-east-1.
Intel Tiber Developer Cloud - Intel's own cloud with Gaudi 2 and Gaudi 3 instances. Free tier available for evaluation. This is the best way to prototype before committing.
# On an AWS DL2q or Intel Tiber Gaudi instance
# Install Habana drivers and SynapseAI (pre-installed on managed images)
# Verify Gaudi cards visible
hl-smi # Gaudi equivalent of nvidia-smi
# Output:
# +-----------------------------------------------------------------------------+
# | HL-SMI Version: hlml/1.10.0 |
# |-------------------------------+----------------------+----------------------|
# | AIP Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC|
# | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | AIP-Util Compute M.|
# |===============================+======================+======================|
# | 0 Gaudi 2 *| 0000:19:00 | N/A|
# | N/A 38C N/A 120W / 600W | 96160MiB / 98304MiB | 0% Default |
# +-------------------------------+----------------------+----------------------|
Monitoring Gaudi Training
Habana provides monitoring tools similar to nvidia-smi and nvitop:
# hl-smi: shows all Gaudi cards, utilization, memory, power
hl-smi
# Watch mode (refresh every 1s)
watch -n 1 hl-smi
# JSON output for programmatic monitoring
hl-smi --query-aip=index,name,memory.used,memory.total,utilization.aip \
--format=csv,noheader
# For multi-node monitoring, Intel provides Grafana dashboards
# via habana-container-toolkit
Key metrics to watch during training:
- AIP utilization: should be 85-95% during steady training. Lower means data loading bottleneck.
- Memory usage: leave 5-10% headroom for workspace allocations
- Power draw: Gaudi 2 TDP is 600W. Thermal throttling starts when sustained at TDP.
Common Mistakes and Pitfalls
:::danger Assuming CUDA Libraries Transfer
pip install flash-attn installs CUDA-compiled C++ extensions. They will not load on HPU. The import will fail. Check every pip install for C++ extensions against Intel's compatibility list before assuming it works.
:::
:::danger Using NCCL for Multi-Card Communication
NVIDIA's NCCL is CUDA-only. For distributed training on Gaudi, the collective communication library is HCCL (Habana Collective Communications Library). Specify backend="hccl" in torch.distributed.init_process_group. Using backend="nccl" will fail.
:::
:::warning Calling .item() Too Frequently
tensor.item() forces a sync from HPU to CPU, which flushes the lazy execution graph and adds overhead. In a tight training loop, calling .item() every step doubles training time. Log only every N steps:
if step % 100 == 0:
loss_val = loss.item() # Only sync every 100 steps
print(f"Loss: {loss_val:.4f}")
:::
:::warning Benchmark on Your Actual Workload Intel's benchmark numbers are for specific LLaMA configurations with full optimization passes. Your custom model may run significantly slower if it has unusual layer types or non-standard attention patterns that land on TPC rather than MME. Always benchmark before committing to hardware procurement. :::
Interview Questions and Answers
Q1: How does Gaudi's built-in RoCE networking differ from NVIDIA's NVLink + InfiniBand approach, and what are the TCO implications?
NVIDIA uses two separate networking fabrics: NVLink for intra-server GPU communication (600 GB/s bidirectional on H100) and InfiniBand (HDR 400 Gb/s or NDR 800 Gb/s) for inter-server communication. InfiniBand requires specialized switches - Mellanox/NVIDIA QM9700 or similar - costing 500-2,000 each. A fat-tree topology for 512 GPUs needs multiple switch tiers.
Gaudi 2 puts 24 x 100GbE RoCE v2 ports directly on the chip. RoCE gives RDMA semantics (kernel bypass, zero-copy) over standard Ethernet infrastructure. Scale-out between servers uses commodity 100GbE Ethernet switches at 10-20x lower cost per port than InfiniBand. For a 512-card cluster, the networking savings can reach $500,000-1,000,000. This is a significant TCO factor when comparing Gaudi to H100 at scale.
Q2: What software changes are required to migrate a standard PyTorch training job from NVIDIA GPUs to Gaudi?
Minimal migration (Hugging Face Transformers, no custom ops):
import habana_frameworks.torch.core as htcoreat the top- Change
device = torch.device("cuda")todevice = torch.device("hpu") - Change
dist.init_process_group(backend="nccl")tobackend="hccl" - Add
htcore.mark_step()at the end of each training step - Replace
torch.optim.AdamWwithFusedAdamWfrom habana_frameworks - Change
torch.autocast(device_type="cuda")todevice_type="hpu"
Non-trivial migration (custom ops, Triton kernels, CUDA libraries):
- Custom CUDA extensions need TPC kernel equivalents (C-like TPC-C language)
- Triton kernels have no direct equivalent - must rewrite as TPC kernels or eliminate
- FlashAttention-2 CUDA import does not work - use Intel's optimized attention via habana_frameworks
- bitsandbytes quantization is CUDA-only - no HPU equivalent for all ops
Q3: Explain Gaudi's lazy execution model and why it exists. What bugs does it introduce that do not exist in CUDA eager mode?
In lazy execution, PyTorch operations are recorded into a computation graph rather than immediately dispatched. The graph is compiled by the SynapseAI compiler and dispatched to the chip when htcore.mark_step() is called or when a CPU-side value is needed. This exists because the SynapseAI compiler can perform cross-operation fusions and memory layout optimizations that are impossible in op-by-op eager execution.
Bugs it introduces:
- Missing mark_step(): graph never flushes, training appears to hang or runs extremely slowly when eventually forced to flush an enormous graph
- Side effects inside graph: if you modify a Python variable inside a lazy region and read it before mark_step(), you get the pre-modification value. This creates subtle correctness bugs in custom training loops.
- OOM at dispatch time, not allocation time: memory errors appear when the graph executes, not when tensors are allocated, making the stack trace harder to interpret.
- Non-deterministic dispatch boundaries: anything that reads a tensor to CPU (logging, assertions) implicitly calls mark_step(). This makes profiling harder.
Q4: What is the TPC core and how would you write a custom operation for it? When would you need to do this?
A TPC (Tensor Processor Core) is a programmable VLIW SIMD processor optimized for element-wise and reduction operations. Each TPC core has SIMD lanes, local memory, and a VLIW instruction set. You write TPC kernels in TPC-C, a C-like language that compiles to TPC instruction sets.
You would write a custom TPC kernel when:
- A PyTorch op you need does not have an HPU implementation (check
habana_frameworks.torch.utils.internal.is_op_implemented) - You have a custom fused operation (e.g., fused LayerNorm + GeLU) and want to avoid the overhead of two separate dispatches
- You need INT8 quantized ops that the standard stack does not cover
The alternative to writing a TPC kernel is to fall back to CPU: move the tensor to CPU, run the operation, move back to HPU. This works but has PCIe transfer overhead and loses the performance benefit.
Q5: How would you design a cost analysis to decide between Gaudi 2 and H100 for a 70B LLM fine-tuning project?
The analysis has four components:
-
Hardware cost: Gaudi 2 30k/card. If you need N cards, Gaudi saves 2x * N * $10,000.
-
Networking cost: Calculate InfiniBand fabric cost for H100 cluster (switches + cables). Calculate 100GbE Ethernet fabric cost for Gaudi cluster. Difference is Gaudi networking savings.
-
Training time cost: Gaudi 2 delivers roughly 65-75% of H100 throughput for 70B LLM training. If H100 finishes in T hours, Gaudi takes T / 0.70 hours. Additional cloud/power cost = (1/0.70 - 1) * T * cost_per_hour.
-
Migration and maintenance cost: Audit codebase for CUDA dependencies. Estimate engineering hours to migrate. Add ongoing maintenance cost for HPU-specific issues. This is often the decisive factor and is frequently underestimated.
Total: if (hardware savings + networking savings) > (extra training time cost + migration cost), Gaudi wins. For large-scale training with clean standard code, Gaudi 2 typically wins on TCO. For research codebases with heavy CUDA customization, the migration cost dominates.
Q6: What is HCCL and how does collective communication work across a multi-server Gaudi cluster?
HCCL (Habana Collective Communications Library) is Gaudi's equivalent of NCCL. It implements collective operations (all-reduce, all-gather, reduce-scatter, broadcast) over the RoCE fabric.
At the intra-server level, HCCL uses the 16 internal RoCE ports connecting the 8 Gaudi cards in a fully-connected mesh. All-reduce within a server runs at up to 1.6 Tbps aggregate.
At the inter-server level, HCCL uses the 8 external 100GbE ports per card through standard Ethernet switches. HCCL implements ring all-reduce across servers, using the RoCE RDMA semantics for low-latency kernel-bypass communication.
For ZeRO-3 with 64 Gaudi 2 cards (8 servers), HCCL handles the all-gather for parameters and reduce-scatter for gradients transparently. DeepSpeed's HCCL integration calls the same API surface as NCCL, so ZeRO-3 configuration is essentially identical.
Profiling Gaudi Training Jobs
Understanding where time is spent in a Gaudi training job requires tools that differ from the NVIDIA ecosystem. Intel provides a profiler integrated with SynapseAI.
SynapseAI Profiler
import habana_frameworks.torch.core as htcore
from habana_frameworks.torch.utils.profiler import ProfilerActivity
# Enable profiling for a specific step range
with torch.profiler.profile(
activities=[
ProfilerActivity.HPU, # HPU kernel execution
ProfilerActivity.CPU, # CPU host ops
],
schedule=torch.profiler.schedule(
wait=5, # Skip first 5 steps (compilation overhead)
warmup=2, # Warmup 2 steps
active=3, # Profile 3 steps
),
on_trace_ready=torch.profiler.tensorboard_trace_handler("./profiler_logs"),
record_shapes=True,
with_stack=True,
) as profiler:
for step, batch in enumerate(dataloader):
inputs = batch.to("hpu")
outputs = model(inputs)
loss = loss_fn(outputs)
loss.backward()
optimizer.step()
htcore.mark_step()
profiler.step() # Advance profiler schedule
Key metrics to look for in profiler output:
- MME utilization: should be high (70%+) during GEMM-heavy steps. Low MME utilization means either TPC operations are dominating or the memory bandwidth is the bottleneck.
- TPC utilization: high TPC utilization for attention, normalization, and activation functions is expected.
- Sync time: time spent in host-device synchronization (mark_step() calls). If this is large, you are syncing too frequently.
- Compilation time: first few steps are slow due to JIT compilation of the lazy graph. Profile after warmup to avoid measuring compilation.
Identifying Bottlenecks with hl-smi
# Real-time GPU utilization monitoring
watch -n 0.5 'hl-smi --query-aip=index,utilization.aip,utilization.memory,\
memory.used,memory.total,power.draw --format=csv,noheader'
# Sample steady-state output during training:
# 0, 94%, 78%, 91200 MiB, 98304 MiB, 598 W
# 1, 92%, 81%, 91200 MiB, 98304 MiB, 602 W
# ...
#
# 94% utilization: good. Means AIP is active 94% of the time.
# 78% memory bandwidth utilization: moderate. Suggests compute-bound phases.
# 598W / 600W TDP: near-max power draw during GEMM phases.
Gaudi in Production: Infrastructure Considerations
Moving from NVIDIA to Gaudi in production is not just a software migration - it has infrastructure consequences.
Container Support
Intel maintains official Docker images for Gaudi with all drivers and frameworks pre-installed:
# Pull the official Intel Gaudi PyTorch image
docker pull vault.habana.ai/gaudi-docker/1.16.0/ubuntu22.04/habanalabs/pytorch-installer-2.3.0:latest
# Run a training container
docker run \
--runtime=habana \
--env HABANA_VISIBLE_DEVICES=all \
--env OMPI_MCA_btl_vader_single_copy_mechanism=none \
--cap-add=sys_nice \
--net=host \
--ipc=host \
-v /path/to/training:/workspace \
vault.habana.ai/gaudi-docker/1.16.0/ubuntu22.04/habanalabs/pytorch-installer-2.3.0:latest \
python /workspace/train.py
Key Docker flags:
--runtime=habana: loads the Habana container runtimeHABANA_VISIBLE_DEVICES=all: expose all Gaudi cards to the container--cap-add=sys_nice: required for HCCL's real-time scheduling--net=host --ipc=host: required for multi-card RoCE communication
Kubernetes Deployment
Intel provides a Gaudi device plugin for Kubernetes:
# Pod spec requesting 8 Gaudi 2 cards
apiVersion: v1
kind: Pod
metadata:
name: gaudi-training-job
spec:
containers:
- name: trainer
image: vault.habana.ai/gaudi-docker/1.16.0/ubuntu22.04/habanalabs/pytorch-installer-2.3.0:latest
resources:
limits:
habana.ai/gaudi: 8 # Request 8 Gaudi cards
memory: "256Gi"
cpu: "64"
command: ["python", "/workspace/train.py"]
env:
- name: HABANA_VISIBLE_DEVICES
value: "all"
volumeMounts:
- name: workspace
mountPath: /workspace
volumes:
- name: workspace
persistentVolumeClaim:
claimName: training-workspace
Model Checkpointing
Checkpointing on Gaudi works identically to CUDA - PyTorch's state_dict mechanism is hardware-agnostic:
import torch
import habana_frameworks.torch.core as htcore
def save_checkpoint(model, optimizer, step, loss, path):
"""Save training checkpoint - identical to CUDA approach."""
# Move state dict to CPU before saving (avoids HPU-specific format)
cpu_state = {k: v.cpu() for k, v in model.state_dict().items()}
torch.save({
"step": step,
"model_state_dict": cpu_state,
"optimizer_state_dict": optimizer.state_dict(),
"loss": loss,
}, path)
print(f"Checkpoint saved at step {step}")
def load_checkpoint(model, optimizer, path, device):
"""Load checkpoint - works for both HPU and CUDA."""
checkpoint = torch.load(path, map_location="cpu")
model.load_state_dict(checkpoint["model_state_dict"])
model.to(device)
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
return checkpoint["step"], checkpoint["loss"]
Summary
Intel Gaudi emerged from the 2019 Habana Labs acquisition as Intel's response to NVIDIA's dominance in AI training. The architecture is built around two compute engines - the programmable TPC cores for element-wise ops and the fixed-function MME for matrix multiply - backed by 96GB HBM2e and 24 on-chip 100GbE RoCE ports that eliminate the need for expensive InfiniBand networking.
Gaudi 2 delivers roughly 65-75% of H100 training throughput at 33% of the price. Gaudi 3 narrows the gap to 85-95% at 60-70% of H100 pricing. The networking savings for large clusters (no InfiniBand required) make the TCO case stronger at scale.
The blocker for many teams is the CUDA ecosystem gap. Custom CUDA ops, Triton kernels, FlashAttention 2, bitsandbytes, vLLM - none of these run on HPU. For teams running standard Hugging Face training pipelines, migration is measured in days. For research codebases with custom CUDA extensions, migration is measured in weeks or months. That migration cost is the decisive factor in most Gaudi adoption decisions.
The software stack (SynapseAI, lazy execution, HPU-fused optimizers, HCCL) is mature enough for production use in 2024. The ecosystem will continue to close the gap with CUDA, but CUDA's 17-year head start means genuine parity is still years away.
