Ampere, Hopper, and Ada Architectures

Reading time: ~35 min · Interview relevance: High · Target roles: ML Engineer, AI Infrastructure Engineer, ML Platform Engineer

It is 3:47 AM. Your training run for a 70B parameter model just hit hour 18 of what was supposed to be a 40-hour job. The loss curve looks fine. The GPU utilization dashboard, however, tells a different story: 34% MFU (Model FLOP Utilization) on a cluster of A100s, with 47 hours remaining. Your manager already forwarded the cloud bill estimate for this month.

You switch to a Slack thread where a colleague at another company ran the same architecture on H100s two weeks ago. Their MFU was 51%, their job finished in 22 hours, and they paid less per useful FLOP because the H100 nodes, while more expensive per hour, finished so much faster. The gap was not random. It came down to one feature: the Transformer Engine in Hopper, which dynamically selects FP8 precision for matrix multiplications inside attention and FFN layers - the exact operations that dominate your training time.

That conversation is what GPU architecture knowledge is actually for. Not for memorizing spec sheets. Not for knowing that the H100 SXM5 has 80 billion transistors or that its die area is 814 mm2. It is for knowing which hardware capability maps to which workload characteristic, so that when you are choosing between GPU generations - or justifying a hardware budget to leadership - you can make the argument from first principles rather than vendor marketing slides.

The three GPU generations covered in this lesson - Ampere (2020), Hopper (2022), and Ada Lovelace (2022) - represent different points in NVIDIA's strategy. Ampere was the foundational generation that established the patterns: high-bandwidth memory, third-generation Tensor Cores, structured sparsity support, and the Multi-Instance GPU feature for multi-tenancy. Hopper took those patterns and rebuilt them specifically around the transformer architecture that was taking over AI workloads. Ada was the consumer branch, optimizing for gaming and prosumer workloads while bringing many compute improvements to the RTX 40 series.

Understanding all three - and their specific numerical differences - means you can reason from architecture to workload to cost. That reasoning is what separates the engineers who pick the right hardware from the ones who discover the mismatch six months into a contract.

Why This Exists

Before Ampere, the standard GPU precision for training was FP32. It was safe. It was numerically stable. It was also wasteful: FP32 carries 23 bits of mantissa precision, and for most neural network training scenarios, you do not need that much. The activations, gradients, and weights during a training step are well-conditioned enough that a smaller format works without loss.

Mixed precision training with FP16 was established by 2017-2018 through work by Micikevicius et al. at NVIDIA and Baidu, later formalized in the torch.cuda.amp automatic mixed precision API. The V100 Tensor Cores in Volta accelerated FP16 matrix multiplications to 125 TFLOPS vs 15.7 TFLOPS for FP32. But FP16 had a stability problem: its dynamic range is only 5 bits of exponent, which means large gradient values during early training or unstable loss landscapes would overflow to inf or NaN. Loss scaling was required, and getting it right was still an engineering burden.

Ampere's answer was TF32: a format with the exponent range of FP32 (8 bits) and the mantissa precision of FP16 (10 bits), running at 10x the speed of FP32 on the new Tensor Cores. TF32 required zero code changes - it is the default behavior when you enable allow_tf32. This single change made the A100 dramatically faster than the V100 for training workloads without asking engineers to change their training loops.

Then models got bigger. GPT-3 in 2020. PaLM in 2022. The compute requirements scaled quadratically with sequence length and linearly with parameter count. FP16 was fast but you were already at half precision. To go further, you needed to push to FP8 - but FP8 had even tighter dynamic range than FP16, making it fragile to apply uniformly across all operations. The Hopper Transformer Engine solved this by making precision selection dynamic: each matrix multiplication is analyzed per-layer, and the hardware chooses between FP8 E4M3 (high precision), FP8 E5M2 (high range), and BF16 depending on the tensor statistics. The result was another 2x in effective throughput for transformer workloads, with the automatic calibration hiding the complexity from engineers.

Ada Lovelace followed a different path. Rather than pushing the boundaries of data center compute, Ada was designed for the consumer and prosumer market. DLSS 3 frame generation, AV1 hardware encoding and decoding, second-generation RT cores for ray tracing - these are gaming features. But Ada also brought fourth-generation Tensor Cores and 24GB GDDR6X in the RTX 4090, making it genuinely interesting for local inference of quantized 7B-13B models, fine-tuning on small datasets, and research experiments that do not require HBM bandwidth.

Historical Context - From Volta to Hopper

The modern AI GPU lineage starts with Volta (2017). The V100 introduced the first Tensor Core - a specialized matrix unit executing $D = A \times B + C$ where $A$ , $B$ are FP16 matrices and $C$ , $D$ are FP32 accumulators. One Tensor Core performed a 4x4x4 matrix multiply-accumulate per clock cycle, yielding 64 FP16 FMA operations. The V100 had 640 Tensor Cores, giving it 125 TFLOPS FP16 peak at 1.53 GHz. This was the first time matrix operations were elevated to first-class hardware instructions rather than built from individual floating-point units.

Turing (2018) introduced integer Tensor Cores (INT8, INT4) for inference workloads, along with the first RT cores for ray tracing. Its impact on AI was primarily in inference deployments where INT8 quantization was already mature.

Ampere (2020) was the step change. The A100 went to 3rd generation Tensor Cores with multiple precision modes, introduced TF32 as a new format, added 2:4 structured sparsity acceleration, doubled HBM to 80GB in the maximum configuration, and added NVLink 3.0 at 600 GB/s aggregate between GPUs. The A100 became the workhorse of ML infrastructure for two years.

Hopper (2022) was rebuilt around transformers. The H100 added the Transformer Engine, 4th generation Tensor Cores with FP8 support, NVSwitch 3.0 for NVLink 4.0 at 900 GB/s aggregate, HBM3 at 3.35 TB/s, and confidential computing capabilities. DPX instructions were also added for dynamic programming algorithms - an unusual addition that reflects NVIDIA's interest in genomics and drug discovery workloads.

Ada Lovelace (2022) runs parallel to Hopper as the consumer branch. Same generation, different target. The RTX 4090 uses GDDR6X instead of HBM, carries 24GB vs 80GB, lacks NVLink, and is designed around gaming and creative workflows. Its Tensor Core capabilities are real but its memory bandwidth (1 TB/s vs 3.35 TB/s for H100) limits it on memory-bound LLM inference.

Timeline:
Volta   (2017) - V100    - 1st gen Tensor Cores, FP16/FP32, 16GB HBM2
Turing  (2018) - T4/RTX  - 2nd gen Tensor Cores, INT8/INT4, RT cores
Ampere  (2020) - A100    - 3rd gen Tensor Cores, TF32, MIG, sparsity
Hopper  (2022) - H100    - 4th gen Tensor Cores, FP8, Transformer Engine
Ada     (2022) - RTX 40  - 4th gen Tensor Cores (consumer), DLSS3, AV1
Blackwell (2024)- B100   - 5th gen Tensor Cores, FP4, NVL72

Ampere Architecture - A100 Deep Dive

Physical Specs and What They Mean

The A100 is manufactured on TSMC 7nm and contains 54.2 billion transistors in a 826 mm2 die. That die area matters because it determines how many compute and memory units fit, which determines peak throughput. The SXM4 variant (the data center version on a high-bandwidth interconnect module) runs at a 400W TDP. The PCIe variant runs at 300W with slightly lower clocks.

Key specifications for the A100 SXM4 80GB:

Specification	Value
Architecture	Ampere (GA100)
Process Node	TSMC 7nm
Transistors	54.2 billion
CUDA Cores	6912
Tensor Cores (3rd gen)	432
FP32 peak	19.5 TFLOPS
TF32 peak	156 TFLOPS
BF16 peak (Tensor Core)	312 TFLOPS
FP16 peak (Tensor Core)	312 TFLOPS
INT8 peak (Tensor Core)	624 TOPS
HBM2e capacity	80 GB
HBM2e bandwidth	2.0 TB/s
NVLink 3.0 bandwidth	600 GB/s (bidirectional)
TDP (SXM4)	400W

Notice the gap between FP32 (19.5 TFLOPS) and TF32 (156 TFLOPS): 8x. Between TF32 and BF16 Tensor Core: 2x. This is the hardware argument for using Tensor Cores over regular CUDA cores for every matrix multiplication you possibly can.

TF32: The Zero-Code Speedup

TF32 is not a standard IEEE format. NVIDIA invented it specifically for the A100 to bridge the gap between FP32 training stability and FP16 Tensor Core speed.

FP32 breakdown: 1 sign bit + 8 exponent bits + 23 mantissa bits = 32 bits total FP16 breakdown: 1 sign bit + 5 exponent bits + 10 mantissa bits = 16 bits total TF32 breakdown: 1 sign bit + 8 exponent bits + 10 mantissa bits = 19 bits total

TF32 takes the dynamic range of FP32 (exponent range covers approximately 1.2e-38 to 3.4e+38) and combines it with the mantissa precision of FP16. The result is a format that is numerically stable for training - it does not overflow during the gradient spikes that break pure FP16 training - but fits through Tensor Core pipelines at high speed.

The math: on the A100, a standard FP32 GEMM runs at ~19.5 TFLOPS. A TF32 Tensor Core GEMM runs at ~156 TFLOPS. That is 8x. For most well-conditioned neural network training, the reduction in mantissa bits from 23 to 10 does not change final model quality. NVIDIA's internal benchmarks showed less than 0.1% degradation in accuracy across ImageNet and language modeling tasks.

Enabling TF32 requires two lines:

import torch

# Enable TF32 for matrix multiplications
torch.backends.cuda.matmul.allow_tf32 = True

# Enable TF32 for cuDNN convolutions
torch.backends.cudnn.allow_tf32 = True

That is all. PyTorch then routes eligible matrix multiplications through TF32 Tensor Core paths automatically. No changes to model code, no changes to loss functions, no mixed precision loss scaling required.

:::tip When to Enable TF32 Enable TF32 by default on any A100, H100, or newer GPU for training. The accuracy impact is negligible for standard architectures (ResNets, transformers, MLPs). The speedup is consistent - typically 3-5x real-world improvement on transformer training compared to FP32, because not every operation is a matrix multiply. Disable it only if you observe unexplained divergence in a highly sensitive training setup. :::

Third-Generation Tensor Cores and Sparsity

The A100's Tensor Cores support a new capability absent from Volta and Turing: 2:4 structured sparsity. The idea is to exploit the empirical observation that trained neural networks often have weight matrices where roughly half the weights are near-zero and removable without significant accuracy loss.

2:4 structured sparsity means: for every 4 consecutive values in a weight matrix, at least 2 must be exactly zero. NVIDIA provides a hardware-accelerated sparse matrix format (called Compressed Sparse Row 2:4, or CSR24) where the non-zero values and their 2-bit position indices are stored compactly, and the Tensor Core processes the compressed representation at 2x throughput.

Sparsity support in PyTorch:

import torch
from torch.ao.pruning import WeightNormSparsifier

# Define model
model = torch.nn.Sequential(
    torch.nn.Linear(4096, 4096),
    torch.nn.ReLU(),
    torch.nn.Linear(4096, 4096),
)

# Create sparsifier targeting all Linear layers
sparsifier = WeightNormSparsifier(
    sparsity_level=0.5,        # target 50% sparsity
    sparse_block_shape=(1, 4), # 2:4 pattern: 1 row, 4 columns per block
    zeros_per_block=2          # 2 zeros per block of 4 = 2:4 structured sparsity
)

# Attach sparsifier to model
sparsifier.prepare(model, config=[
    {"tensor_fqn": "0.weight"},
    {"tensor_fqn": "2.weight"},
])

# Run training or calibration data through the model
# (sparsifier observes which weights are least important)
with torch.no_grad():
    dummy_input = torch.randn(32, 4096)
    _ = model(dummy_input)

# Apply sparsity - zeroes out least-significant weights in 2:4 pattern
sparsifier.step()

# Convert to sparse tensor format for hardware acceleration
sparsifier.squash_mask()

# Now model.0.weight is in 2:4 sparse format
# On A100/H100 hardware, GEMMs with these weights run at ~2x throughput
print(model[0].weight.is_sparse)  # True on supported backends

Which layers benefit from sparsity:

Linear layers in FFN blocks: yes, strong candidates
Attention projection layers (Q, K, V, O): yes, moderate candidates
Embedding layers: generally no - embeddings are already sparse by nature and irregular access patterns negate the speedup
Normalization layers: not applicable (no large weight matrices)
Small layers (< 1024 dimensions): overhead may exceed benefit

:::warning Sparsity Accuracy Degradation 2:4 structured sparsity typically requires magnitude pruning followed by a fine-tuning phase to recover accuracy. Applying sparsity to a trained model without fine-tuning causes 2-5% accuracy degradation on typical language modeling tasks. The correct workflow is: train dense to convergence, apply 2:4 sparsity, fine-tune for 10-20% of original training steps. NVIDIA's ASP (Automatic Sparsity) library provides tooling for this workflow. :::

Multi-Instance GPU (MIG)

Before MIG, you had one option when running multiple workloads on a single GPU: time-sharing. One job ran, then another. GPU resources were not partitioned - workloads competed for SMs and memory bandwidth through the scheduler.

MIG changes this for the A100 and H100. It physically partitions the GPU silicon into isolated instances, each with:

Dedicated streaming multiprocessors (SMs)
Dedicated L2 cache slices
Dedicated HBM memory regions
Hardware-enforced isolation: one instance cannot access another's memory

The A100 80GB supports up to 7 GPU instances (GIs) in the maximum partition configuration. Each 1g.10gb instance gets 1/7 of the SMs (14 SMs out of 108), 10GB of dedicated HBM2e, and isolated memory bandwidth.

Supported partitions on A100:

Instance Type	SMs	Memory	Max Count
1g.10gb	14	10 GB	7
2g.20gb	28	20 GB	3
3g.40gb	42	40 GB	2
4g.40gb	56	40 GB	1
7g.80gb	108	80 GB	1 (full GPU)

MIG configuration via NVIDIA Management Library:

# Enable MIG mode (requires root, persists across reboots)
sudo nvidia-smi -i 0 --mig-mode=1

# Create 7 x 1g.10gb instances
sudo nvidia-smi mig -cgi 1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb

# Create compute instances within each GPU instance
sudo nvidia-smi mig -cci

# List all MIG instances
nvidia-smi mig -lgi

# Each instance appears as a separate GPU to CUDA applications
# Use CUDA_VISIBLE_DEVICES to target a specific instance
CUDA_VISIBLE_DEVICES=MIG-GPU-abc123:0:0 python inference.py

When to use MIG:

Inference serving: multiple small models served simultaneously on one A100, each isolated
Multi-tenant cloud environments: different customers get dedicated GPU instances with hardware isolation and no cross-tenant memory leakage
Development clusters: multiple researchers share a single A100 without interference
Batch inference with varying model sizes: partition appropriately per model

When not to use MIG:

Single large training job: MIG partitions limit you to a fraction of the GPU's memory and SMs, slowing down large training
Model parallelism: tensor parallel and pipeline parallel strategies require unrestricted intra-GPU bandwidth that MIG breaks
Dynamic workloads with unpredictable memory needs: MIG memory regions are fixed at partition creation time

NVLink 3.0 and Multi-GPU Communication

Within an 8-GPU A100 DGX A100 system, the GPUs are connected by NVLink 3.0. Each GPU has 12 NVLink 3.0 links, each providing 50 GB/s bidirectional bandwidth. Total aggregate NVLink bandwidth per GPU: 600 GB/s bidirectional.

For comparison:

PCIe 4.0 x16: 32 GB/s bidirectional (total)
NVLink 3.0 (A100): 600 GB/s bidirectional
NVLink 4.0 (H100): 900 GB/s bidirectional

The 19x advantage of NVLink over PCIe is what makes GPU-to-GPU AllReduce fast enough to scale to 8 GPUs efficiently. For a 70B model's gradient synchronization, you are moving hundreds of GB of gradient data per step. At PCIe speeds, this would be the bottleneck. At NVLink speeds, communication can partially overlap with computation.

Hopper Architecture - H100 Deep Dive

The Transformer Engine

The Transformer Engine is the central architectural innovation in Hopper. Understanding it requires understanding what problem it solves.

FP8 has two variants:

E4M3: 4 exponent bits, 3 mantissa bits. Smaller dynamic range (roughly ±448), higher precision within that range. Best for forward pass activations and weights.
E5M2: 5 exponent bits, 2 mantissa bits. Larger dynamic range, lower precision. Best for gradients, which can vary more widely in magnitude.

Using FP8 naively is fragile. An FP8 tensor with the wrong scaling factor will saturate (all values clamp to max) or underflow (all values round to zero), producing garbage results. Calibrating FP8 scaling factors manually for every tensor in every layer of a large model is impractical.

The Transformer Engine solves this with dynamic per-tensor scaling. Before each GEMM, the engine:

Analyzes the magnitude statistics of the input tensors (specifically, the maximum absolute value)
Selects a scale factor that maps the distribution into the FP8 representable range
Executes the GEMM in FP8
Rescales the output back to BF16 for accumulation

This happens at the hardware level, per-layer, per-step, automatically. The engineer sees BF16 tensors at the Python level. The hardware executes in FP8 where safe and falls back to BF16 where not.

H100 specification comparison for training:

Format	H100 SXM5 Peak TFLOPS	A100 SXM4 Peak TFLOPS	Speedup
FP32	67	19.5	3.4x
TF32	989	156	6.3x
BF16	1979	312	6.3x
FP16	1979	312	6.3x
FP8	3958	N/A	-
INT8	3958	624	6.3x

The FP8 peak of 3958 TFLOPS on H100 is roughly 2x the BF16 peak. In practice, LLM training sees 30-50% of peak TFLOPS for the reasons noted below: memory bandwidth limits on small attention heads, communication overhead during AllReduce, non-GEMM operations (activations, normalization, embedding lookup). But even at 35% utilization, H100 with FP8 running at 1385 effective TFLOPS outperforms A100 at 35% BF16 utilization (109 effective TFLOPS) by over 12x.

Using the Transformer Engine in PyTorch

import torch
import transformer_engine.pytorch as te
from transformer_engine.common.recipe import Format, DelayedScaling

# Configure FP8 recipe
fp8_recipe = DelayedScaling(
    margin=0,                    # log2 of margin for scaling factor update
    interval=1,                  # update scaling factors every step
    fp8_format=Format.HYBRID,    # E4M3 for forward, E5M2 for backward gradients
    amax_history_len=16,         # steps of amax history for robust scaling
    amax_compute_algo="max",     # use max of amax history
)

# Use transformer_engine layers instead of torch.nn
class TransformerBlock(torch.nn.Module):
    def __init__(self, hidden_size, num_heads, ffn_hidden_size):
        super().__init__()
        self.attention = te.MultiheadAttention(
            hidden_size=hidden_size,
            num_attention_heads=num_heads,
        )
        self.layernorm_mlp = te.LayerNormMLP(
            hidden_size=hidden_size,
            ffn_hidden_size=ffn_hidden_size,
            activation="gelu",
        )

    def forward(self, x, attn_mask=None):
        # Attention
        out, _ = self.attention(x, x, x, attn_mask=attn_mask)
        # FFN
        out = self.layernorm_mlp(out)
        return out

model = TransformerBlock(
    hidden_size=4096,
    num_heads=32,
    ffn_hidden_size=16384,
).cuda()

# Training step with FP8 context
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

for step, batch in enumerate(dataloader):
    x = batch["input"].cuda()

    with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
        output = model(x)
        loss = criterion(output, batch["target"].cuda())

    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

The te.fp8_autocast context manager is the boundary. Inside it, eligible GEMMs in TransformerEngine layers execute in FP8. Outside it, everything is BF16. The scaling factor calibration is fully automatic inside the DelayedScaling recipe - you configure the history length and margin, and the runtime handles the rest.

:::tip TransformerEngine vs torch.amp PyTorch's torch.cuda.amp.autocast(dtype=torch.bfloat16) is the standard mixed precision API. TransformerEngine's te.fp8_autocast is an additional layer on top. You can and should use both: BF16 autocast for all non-GEMM operations, FP8 autocast for GEMMs inside transformer layers. They compose cleanly because TransformerEngine handles the boundary logic. :::

H100 Physical Specifications

Specification	H100 SXM5	A100 SXM4
Architecture	Hopper (GH100)	Ampere (GA100)
Process Node	TSMC 4nm	TSMC 7nm
Transistors	80 billion	54.2 billion
CUDA Cores	16896	6912
Tensor Cores (gen)	528 (4th)	432 (3rd)
HBM Generation	HBM3	HBM2e
HBM Capacity	80 GB	80 GB
HBM Bandwidth	3.35 TB/s	2.0 TB/s
NVLink Version	4.0	3.0
NVLink Bandwidth	900 GB/s	600 GB/s
TDP	700W	400W
Transformer Engine	Yes (FP8)	No
MIG Support	Yes (7 instances)	Yes (7 instances)
Confidential Compute	Yes	No

The HBM3 bandwidth jump from 2.0 TB/s to 3.35 TB/s (67% increase) matters most for inference of large models in BF16 or FP16, where you are memory-bandwidth-bound rather than compute-bound. A 70B model in BF16 requires 140GB just for weights, meaning inference on a single H100 or split across two, and each token generation step loads the full parameter set from HBM. Higher bandwidth means more tokens per second.

DPX Instructions

Hopper added DPX (Dynamic Programming X) instructions - a set of specialized integer instructions for dynamic programming algorithms. These are relevant for:

Smith-Waterman genome sequence alignment (bioinformatics)
Viterbi decoding (HMMs, speech recognition)
Shortest path algorithms on graphs (logistics, network routing)

DPX instructions provide hardware support for the min/max reduction operations that dominate DP recurrences. For NVIDIA's healthcare and drug discovery market segment, this is meaningful. For most LLM training and inference workloads, DPX is irrelevant.

NVSwitch 3.0 and NVLink 4.0

In multi-GPU configurations, Hopper uses NVSwitch 3.0 as the switch fabric connecting all GPUs via NVLink 4.0. An 8-GPU HGX H100 system provides:

18 NVLink 4.0 links per GPU
900 GB/s bidirectional bandwidth per GPU
All-to-all GPU communication without crossing PCIe

The DGX H100 supports 8 GPUs with full NVSwitch connectivity. Beyond 8 GPUs, InfiniBand (HDR or NDR) is used for cross-node communication, and the NVLink network is per-node.

For very large-scale training (1000+ GPUs), the bottleneck is the cross-node bandwidth, not within-node NVLink. A 400 Gb/s HDR InfiniBand link is 50 GB/s - 18x slower than NVLink. This is why tensor parallelism (which requires high-bandwidth within-node communication) is used within a node, and pipeline or sequence parallelism (which requires less frequent cross-node communication) is used across nodes.

Ada Lovelace Architecture - RTX 4090 Deep Dive

Positioning: Consumer vs Data Center

Ada Lovelace is NVIDIA's 2022 consumer GPU architecture. The RTX 4090 is the flagship. It shares the same process node as Hopper (TSMC 4nm) and the same generation of Tensor Core (4th generation), but the product decisions reflect an entirely different market.

Specification	RTX 4090 (Ada)	H100 SXM5 (Hopper)
Architecture	Ada Lovelace	Hopper
CUDA Cores	16384	16896
Tensor Cores	512	528
Memory Type	GDDR6X	HBM3
Memory Capacity	24 GB	80 GB
Memory Bandwidth	1.0 TB/s	3.35 TB/s
BF16 Tensor Core TFLOPS	1321	1979
FP8 Support	No	Yes
Transformer Engine	No	Yes
NVLink	No	900 GB/s
TDP	450W	700W
Retail/Cloud Price	~$1600 retail	~$3/hr cloud

The critical differences for ML workloads:

Memory bandwidth: 1.0 TB/s vs 3.35 TB/s. For memory-bandwidth-bound inference, the RTX 4090 is 3.35x slower per dollar... except it is dramatically cheaper to buy outright.
Memory capacity: 24 GB vs 80 GB. The RTX 4090 fits 7B models comfortably in FP16, or 13B models with 4-bit quantization. 70B models require quantization to fit.
No FP8: Ada does not support the Transformer Engine or FP8 GEMMs. Training throughput for large models is limited to BF16.
No NVLink: Multi-GPU RTX 4090 setups communicate over PCIe 4.0 at 32 GB/s, making tensor parallelism across multiple RTX 4090s very slow.

Where Ada Makes Sense for ML

The RTX 4090 is the best consumer GPU for ML if:

You are running local inference on quantized 7B-13B models (LLaMA 2/3, Mistral, Phi, etc.)
You are doing rapid prototyping and experimentation on smaller models
You want a dedicated research machine and cloud costs are too high for your budget
You are fine-tuning small models using LoRA/QLoRA on modest datasets
You want to run stable diffusion, video generation, or multimodal models at home

Using the RTX 4090 for local inference:

# llama.cpp / llama-cpp-python: most memory-efficient option for local inference
from llama_cpp import Llama

# Load 7B model in 4-bit GGUF format (fits in ~4GB VRAM)
llm = Llama(
    model_path="./models/llama-3-7b-instruct.Q4_K_M.gguf",
    n_gpu_layers=-1,    # offload all layers to GPU
    n_ctx=4096,         # context window
    n_batch=512,        # batch size for prompt processing
)

response = llm("Explain transformer attention in one paragraph:", max_tokens=200)
print(response["choices"][0]["text"])

# Hugging Face transformers with bitsandbytes 4-bit quantization
# Load 13B model on RTX 4090 24GB
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",          # NormalFloat4 quantization
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-13b-chat-hf",
    quantization_config=quantization_config,
    device_map="cuda:0",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-chat-hf")

inputs = tokenizer("What is a transformer?", return_tensors="pt").to("cuda")
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

DLSS 3 Frame Generation and AV1

DLSS 3 (Deep Learning Super Sampling 3) uses Ada's Optical Flow Accelerator to generate entire frames between rendered frames, doubling the apparent frame rate in supported games. This is a consumer gaming feature that has no direct relevance to ML training or inference, but it illustrates NVIDIA's strategy of building neural network inference hardware into consumer GPUs and amortizing the cost across the gaming market.

AV1 hardware encode and decode are similarly consumer-focused. The RTX 4090 can encode AV1 video at high quality and high speed, relevant for streaming, video creation, and potentially video ML pipelines where you need fast preprocessing. For ML model training and inference, AV1 hardware is background noise.

Architecture Comparison for LLM Training

FP8 Precision Formats in Depth

FP8 is not a single format. There are two variants, each with different precision/range tradeoffs:

E4M3 (4 exponent bits, 3 mantissa bits):

Representable range: approximately -448 to +448
More mantissa bits = better precision within the range
Used for: weights and forward-pass activations, which are typically well-scaled

E5M2 (5 exponent bits, 2 mantissa bits):

Representable range: approximately -57344 to +57344
More exponent bits = larger range, less precision
Used for: backward-pass gradients, which can have larger magnitude spikes

The Transformer Engine's HYBRID format uses E4M3 for the forward pass and E5M2 for gradients. This matches the natural distribution of each quantity.

import torch
import transformer_engine_extensions as tex

# Inspect the actual FP8 quantization in TransformerEngine
# (diagnostic example - not standard user-facing API)

# E4M3 max representable value
e4m3_max = 448.0

# E5M2 max representable value
e5m2_max = 57344.0

# Illustrating manual quantization (TransformerEngine does this automatically)
def quantize_fp8_e4m3(tensor: torch.Tensor, scale: float) -> torch.Tensor:
    """
    Manual E4M3 quantization for illustration.
    TransformerEngine handles this automatically inside fp8_autocast.
    """
    scaled = tensor * scale
    # Clamp to E4M3 representable range
    clamped = torch.clamp(scaled, -e4m3_max, e4m3_max)
    # In practice, hardware maps to nearest E4M3 representable value
    return clamped

def calibrate_scale(tensor: torch.Tensor, target_max: float = e4m3_max) -> float:
    """Compute scale factor to map tensor into FP8 range."""
    amax = tensor.abs().max().item()
    if amax == 0:
        return 1.0
    return target_max / amax

# Example: a weight matrix from a transformer FFN layer
weight = torch.randn(4096, 16384).cuda()
scale = calibrate_scale(weight)
quantized = quantize_fp8_e4m3(weight, scale)

print(f"Weight amax: {weight.abs().max():.4f}")
print(f"Scale factor: {scale:.6f}")
print(f"After scaling amax: {quantized.abs().max():.4f}")
print(f"Max E4M3 representable: {e4m3_max}")

Mermaid: Ampere vs Hopper Feature Comparison

Production Engineering Notes

Choosing Between A100 and H100 for LLM Training

The A100 remains a competitive option in 2024-2025 because H100 availability and pricing are constrained. Here is how to reason about the choice:

Compute-bound workloads (large batch sizes, long sequences, big models): H100 wins clearly. The FP8 Transformer Engine provides 2x throughput over BF16. For a 70B model training job that would take 30 days on A100s, it might take 15 days on H100s.

Memory-bandwidth-bound workloads (small batch inference, autoregressive decoding): H100 wins on bandwidth (3.35 TB/s vs 2.0 TB/s). For single-token generation on a 70B model loaded in BF16 across 2 GPUs, H100 generates ~1.67x more tokens per second.

Cost per FLOP comparison (rough, varies by cloud provider and spot pricing):

A100 SXM4: ~$2.00/hr on major clouds
H100 SXM5: ~$3.50/hr on major clouds
H100 effective BF16 TFLOPS: 6.3x A100
H100 cost multiplier: 1.75x
Cost per TFLOP: H100 is roughly 3.5x cheaper per TFLOP at BF16

At BF16, H100 is cheaper per computation. With FP8 enabled, it is even more favorable. The only reason to choose A100 today is availability or specialized use cases (older codebases, infrastructure that is already provisioned on A100, cost certainty on long-running contracts).

Maximizing Throughput on A100

import torch
import torch.nn as nn
from torch.cuda.amp import autocast, GradScaler

# Full mixed precision training setup for A100
# Combines TF32 + BF16 AMP for maximum throughput

# Step 1: Enable TF32 globally
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

# Step 2: Set default dtype for BF16 training
# (BF16 does not need loss scaling - wider exponent range than FP16)
torch.set_default_dtype(torch.float32)  # weights stay FP32

model = nn.TransformerEncoder(
    nn.TransformerEncoderLayer(
        d_model=1024,
        nhead=16,
        dim_feedforward=4096,
        batch_first=True,
    ),
    num_layers=24,
).cuda()

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=0.1)

# For BF16: no loss scaler needed (use autocast with bfloat16)
# For FP16: use GradScaler to prevent underflow
use_bf16 = True  # A100 and H100 support BF16 natively

for batch in dataloader:
    src = batch["input"].cuda()
    tgt = batch["target"].cuda()

    with autocast(dtype=torch.bfloat16 if use_bf16 else torch.float16):
        output = model(src)
        loss = criterion(output, tgt)

    loss.backward()

    # Gradient clipping before optimizer step
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

    optimizer.step()
    optimizer.zero_grad()

Memory Bandwidth Arithmetic

Understanding when you are memory-bound vs compute-bound is essential for production optimization.

A100 arithmetic intensity breakeven (TF32):

Compute: 156 TFLOPS = 156e12 FLOPs/sec
Memory bandwidth: 2.0 TB/s = 2.0e12 bytes/sec
Breakeven arithmetic intensity: 156e12 / 2.0e12 = 78 FLOPs/byte

H100 arithmetic intensity breakeven (BF16):

Compute: 1979 TFLOPS
Memory bandwidth: 3.35 TB/s
Breakeven: 1979 / 3.35 = 590 FLOPs/byte

Any operation with arithmetic intensity below the breakeven is memory-bandwidth-bound. Any operation above is compute-bound.

GEMM arithmetic intensity for a matrix multiply $A (M \times K) \times B (K \times N)$ :

FLOPs: $2 \times M \times K \times N$
Bytes read: $(M \times K + K \times N) \times \text{bytes\_per\_element}$

For large matrices (M=4096, K=4096, N=4096) in BF16 (2 bytes):

FLOPs: $2 \times 4096^3 = 137$ billion
Bytes: $(4096^2 + 4096^2) \times 2 = 67$ MB
Arithmetic intensity: 137e9 / 67e6 = ~2000 FLOPs/byte

This is far above the H100 breakeven (590), so large GEMMs are compute-bound on H100. Small GEMMs (e.g., batch=1, M=1, K=4096, N=16384 during autoregressive inference) have much lower arithmetic intensity and are memory-bandwidth-bound.

def compute_arithmetic_intensity(M: int, K: int, N: int, bytes_per_element: int = 2) -> float:
    """
    Compute arithmetic intensity of a GEMM operation.
    Returns FLOPs per byte - compare to GPU breakeven to determine bottleneck.
    """
    flops = 2 * M * K * N
    bytes_read = (M * K + K * N) * bytes_per_element
    bytes_written = M * N * bytes_per_element
    total_bytes = bytes_read + bytes_written
    return flops / total_bytes

# Large GEMM (training, large batch)
ai_train = compute_arithmetic_intensity(4096, 4096, 4096)
print(f"Training GEMM AI: {ai_train:.0f} FLOPs/byte")  # ~2000

# Inference GEMM (batch=1, single token generation)
ai_infer = compute_arithmetic_intensity(1, 4096, 16384)
print(f"Inference GEMM AI: {ai_infer:.2f} FLOPs/byte")  # ~1.9 - memory bound!

# H100 BF16 breakeven
h100_breakeven = 1979e12 / 3.35e12
print(f"H100 BF16 breakeven: {h100_breakeven:.0f} FLOPs/byte")  # ~590

print(f"Training: {'compute-bound' if ai_train > h100_breakeven else 'memory-bound'}")
print(f"Inference: {'compute-bound' if ai_infer > h100_breakeven else 'memory-bound'}")

Mermaid: Memory Hierarchy and Bandwidth

Common Mistakes

:::danger Treating Peak TFLOPS as Real Throughput H100 SXM5 is rated at 989 TFLOPS for TF32 and 3958 TOPS for FP8. Real LLM training throughput is typically 30-55% of peak TFLOPS on well-optimized workloads. Memory bandwidth limits, communication overhead in tensor parallel training, activations/normalization layers (not Tensor Core eligible), and kernel launch overhead all reduce effective utilization. Never use peak TFLOPS for capacity planning. Use measured MFU (Model FLOP Utilization) from benchmarks on your specific model architecture. An MFU of 45% on H100 for a 70B model is excellent. Plan conservatively at 35%. :::

:::danger Enabling FP8 Without TransformerEngine-Compatible Layers FP8 quantization is only safe inside the TransformerEngine's managed GEMM operations. You cannot call torch.matmul and expect it to run in FP8 correctly - you will get silent numerical errors or NaN loss. The TransformerEngine wraps individual GEMMs with per-tensor scaling. Using te.fp8_autocast() context manager with standard torch.nn.Linear layers does nothing - only te.Linear, te.MultiheadAttention, te.LayerNormMLP and other TransformerEngine-native layers are FP8-capable. :::

:::warning TF32 is Not FP32 - Regression Testing Required TF32 reduces mantissa precision from 23 bits to 10 bits. For most neural network training, this is fine. But some workloads are sensitive: numerical integration code, certain RL algorithms with reward signals that require fine-grained precision, or any ML pipeline that compares exact floating-point values. Before enabling TF32 globally, run a comparison test on your specific task. If loss curves diverge or final accuracy changes by more than 0.1%, investigate. Disabling per-operation is possible via torch.set_default_dtype and per-layer precision overrides. :::

:::warning MIG Breaks Multi-GPU Training MIG partitions are isolated. NVLink bandwidth does not cross MIG boundaries. If you partition an A100 into 7 x 1g.10gb instances and try to run tensor-parallel training across them, the GPUs will communicate over PCIe (because NVLink isolation is enforced), and your training will be severely bandwidth-constrained. Use MIG only for inference serving or multi-tenant scenarios where each workload fits entirely within a single MIG instance. :::

:::warning 2:4 Sparsity Requires a Fine-Tuning Phase Applying 2:4 structured sparsity to a fully trained dense model without subsequent fine-tuning causes 2-5% task accuracy degradation in language modeling. The correct workflow: (1) train dense to full convergence, (2) apply magnitude pruning with 2:4 pattern, (3) fine-tune for at least 10-20% of the original training budget. NVIDIA's ASP (Automatic Sparsity) library automates steps 2-3. Do not skip the fine-tuning phase and report "our sparse model lost 3% accuracy" - that is a workflow failure, not a fundamental limitation of sparsity. :::

:::warning GDDR6X vs HBM - RTX 4090 Bandwidth Ceiling The RTX 4090's 1.0 TB/s GDDR6X bandwidth feels fast until you compare it to H100's 3.35 TB/s HBM3. For memory-bandwidth-bound inference of 13B-70B models, the RTX 4090 will be 3x slower per second. For local development and experimentation this is acceptable. For production inference serving with SLA requirements, this ceiling is the decisive factor. Quantization (4-bit or 8-bit) shifts the bottleneck from bandwidth to compute, which can partly neutralize the HBM advantage for inference, but training remains bandwidth-sensitive. :::

Interview Q&A

Q1: What is the Transformer Engine in the H100, and what problem does it solve that Ampere could not?

A: The Transformer Engine is a hardware-software system in Hopper that enables safe, automatic FP8 precision for GEMM operations inside transformer layers. Ampere's Tensor Cores supported TF32, BF16, FP16, and INT8. FP8 was not present in Ampere because it is numerically fragile without dynamic calibration: FP8 E4M3 has a max representable value of 448, meaning an uncalibrated weight or activation tensor could overflow entirely.

The Transformer Engine solves this with per-tensor dynamic scaling. Before each FP8 GEMM, the engine computes the amax (absolute maximum) of the input tensors, derives a scale factor that maps the distribution into FP8 range, executes the GEMM in FP8 hardware, and rescales the output back to BF16. The HYBRID recipe uses E4M3 (higher precision, smaller range) for forward-pass tensors and E5M2 (lower precision, larger range) for gradients in the backward pass.

The result is approximately 2x throughput over BF16 for Transformer GEMM-dominated workloads. On H100, FP8 peaks at 3958 TFLOPS vs 1979 for BF16. Real LLM training throughput improvement is typically 1.3-1.8x when moving from BF16 on A100 to FP8 on H100, accounting for non-GEMM operations and communication overhead.

Q2: Explain Multi-Instance GPU (MIG) on the A100. When would you use it, and what are its limitations?

A: MIG physically partitions the A100 (or H100) GPU into up to 7 isolated instances, each with dedicated streaming multiprocessors, L2 cache slices, and HBM memory regions. The isolation is hardware-enforced - one instance cannot access another's memory, which provides both performance isolation (no interference between workloads) and security isolation (no cross-tenant data leakage).

The A100 supports partition configurations from 1g.10gb (1/7 of the GPU, 10GB HBM, 14 SMs) up to the full 7g.80gb (all 108 SMs, 80GB). You can mix partition sizes: for example, one 3g.40gb instance and one 2g.20gb instance and two 1g.10gb instances.

Use MIG when:

Running multiple small inference models simultaneously and you need guaranteed isolation (multi-tenant cloud, SaaS)
You have burst inference traffic and want predictable latency per model
Different teams share a cluster and need guaranteed, predictable GPU quota
Models are small enough to fit in a partition (7B model in INT8 = ~7GB, fits in 1g.10gb)

Do not use MIG when:

Running any multi-GPU training with tensor/pipeline parallelism (NVLink does not cross MIG boundaries)
Running a single large model that needs the full 80GB
Workloads have irregular memory access patterns that spike beyond the partition's HBM allocation

One subtle limitation: NVLink bandwidth is completely isolated per MIG instance. A 1g.10gb instance gets zero NVLink bandwidth. All communication to other GPUs (even other MIG instances on the same physical GPU) goes through PCIe.

Q3: You have a training budget for either 8x A100 80GB or 4x H100 80GB. Same total cost per hour. Which do you choose for a 13B transformer training run, and why?

A: This is a real trade-off that depends on the specific workload, but for a standard 13B transformer training run (dense, BF16 or FP8), the 4x H100 configuration wins on speed and likely on cost per token.

Compute comparison:

8x A100: 8 * 312 TFLOPS (BF16) = 2496 TFLOPS aggregate
4x H100: 4 * 1979 TFLOPS (BF16) = 7916 TFLOPS aggregate

Even at BF16 without FP8, 4x H100 delivers 3.2x more aggregate TFLOPS. With FP8 enabled via TransformerEngine for eligible layers, this gap widens to roughly 6x (4 * 3958 = 15832 effective FP8 TOPS).

Communication overhead:

8x A100: 8 GPUs means 8-way AllReduce per backward pass. More GPUs = more communication overhead.
4x H100: 4-way AllReduce with higher NVLink bandwidth (900 vs 600 GB/s). AllReduce time is proportional to (model size in bytes) / (bandwidth * num_links).

For a 13B model with 26GB of parameters in BF16, gradient AllReduce per step:

A100 8-GPU: 26 GB total, ring AllReduce across 8 GPUs = ~26 GB / 600 GB/s latency dominated by per-hop delay
H100 4-GPU: 26 GB total, ring across 4 GPUs with 900 GB/s = faster per hop and fewer hops

Memory: Both configurations have 320GB total HBM. For a 13B model in BF16, weights need 26GB, optimizer states (Adam) need ~104GB in FP32. Total ~130GB. Both fit with room for activations.

Conclusion: 4x H100 trains faster per hour, has lower communication overhead, and with FP8 enabled, completes the job in significantly less wall-clock time. Unless the 8x A100 configuration comes with significantly lower per-hour cost (more than 3x cheaper per hour, accounting for the compute gap), 4x H100 is the better economic choice for this workload.

Q4: What is 2:4 structured sparsity, how does the A100 accelerate it, and what is the correct workflow to apply it?

A: 2:4 structured sparsity is a compression pattern where, for every 4 consecutive values in a matrix row, at least 2 must be exactly zero. The term "structured" distinguishes it from unstructured sparsity (where zeros can be anywhere), which is hard to accelerate in hardware because the non-zero positions are irregular and cannot be predicted by memory access patterns.

With 2:4 sparsity, NVIDIA designed a compressed sparse format called CSR24 where only the 2 non-zero values per group of 4 are stored (plus 2-bit indices indicating their positions). The A100's Tensor Cores can process this compressed format directly: they load the 2 non-zero values, look up their positions, and perform the multiply-accumulate against the corresponding elements of the other matrix. This gives exactly 2x throughput for GEMM operations on 2:4-sparse matrices vs dense matrices at the same precision (INT8, BF16, etc.).

Why 2:4 specifically: It is the sparsest pattern that NVIDIA could implement hardware acceleration for with a reasonable circuit area overhead. 1:4 would be more sparse but requires more complex index logic. 1:2 (50% sparsity, alternating zeros) is the simplest but wastes hardware. 2:4 was the engineering sweet spot.

Correct workflow:

Train the dense model to full convergence. Do not prune during training.
Apply magnitude-based 2:4 pruning: within each group of 4 consecutive values, zero out the 2 with smallest absolute magnitude. Use torch.ao.pruning.WeightNormSparsifier with sparse_block_shape=(1,4) and zeros_per_block=2.
Fine-tune the pruned model for 10-20% of the original training budget. The remaining 50% of weights are updated to compensate for the removed ones.
Convert to NVIDIA's sparse tensor format for hardware acceleration.
Verify: benchmark GEMM speed and accuracy degradation. Target: <1% accuracy drop and 1.5-1.9x measured speedup (slightly below theoretical 2x due to overhead).

Layers that benefit most: linear layers in FFN blocks with large dimensions (4096x16384 typical). Layers that do not benefit: embedding layers (sparse by nature, irregular access), normalization layers (no eligible GEMMs).

Q5: Describe the TF32 data format. What did NVIDIA change in its mantissa and exponent compared to FP32 and FP16, and what is the practical implication for training stability?

A: TF32 (TensorFloat-32) is a non-standard format invented by NVIDIA for the A100 Tensor Cores. It is not an IEEE 754 standard and cannot be addressed directly in software - it exists only as a hardware-internal format for Tensor Core GEMM operations.

Bit layout comparison:

Format	Sign	Exponent	Mantissa	Total bits
FP32	1	8	23	32
FP16	1	5	10	16
TF32	1	8	10	19
BF16	1	8	7	16

TF32 takes FP32's 8-bit exponent (range: 1.2e-38 to 3.4e38) and FP16's 10-bit mantissa (precision: about 3 decimal digits). From a dynamic range perspective, TF32 is identical to FP32 - it can represent the same range of values and does not overflow where FP32 would not. From a precision perspective, it is identical to FP16 - the mantissa is truncated from 23 to 10 bits.

Practical implication for training stability: The dominant cause of FP16 training instability is dynamic range overflow. When gradients spike early in training, FP16's 5-bit exponent (max value ~65504) saturates and produces inf. This is why FP16 mixed precision requires loss scaling. TF32's 8-bit exponent is identical to FP32, so it never overflows where FP32 would not. This makes TF32 safe to enable globally with zero code changes and no loss scaling. The reduced mantissa precision (23 to 10 bits, losing roughly 13 bits of fractional precision) rarely matters for neural network training because weight updates are inherently noisy from stochastic gradient descent.

When TF32 can cause problems: Highly sensitive numerical computations embedded in a training loop - for example, certain optimization algorithms that rely on high-precision gradient accumulation, or regularization terms that measure exact function values. In these edge cases, disable TF32 per-operation by using explicit FP32 dtype and avoiding Tensor Core paths, or run a convergence comparison against FP32 baseline to verify the impact is acceptable.

Q6: For inference serving at scale on the H100, when is memory bandwidth the bottleneck and when is compute the bottleneck? What can you do about each?

A: The answer depends on batch size and sequence length, and it is captured by the arithmetic intensity of the operations in your serving path.

When memory bandwidth is the bottleneck (low arithmetic intensity):

Batch size 1, autoregressive decoding (each token generation loads all 70B parameters from HBM)
Small batch sizes (< 16) with short sequences
Arithmetic intensity for batch=1 GEMM is approximately 2 FLOPs per 4 bytes (BF16) = 0.5 FLOPs/byte, far below H100's breakeven of ~590 FLOPs/byte
Symptom: GPU compute utilization looks low (< 20%), but memory bandwidth utilization is high (> 70%)

What to do: Increase batch size (continuous batching / dynamic batching), use quantization (4-bit, 8-bit - reduces bytes loaded per parameter, shifting breakeven), use speculative decoding (draft model generates multiple candidates, target model verifies them in parallel - dramatically increases effective batch size for the target model).

When compute is the bottleneck (high arithmetic intensity):

Large batch sizes (128+), long context prefill (prompt processing of long documents)
Arithmetic intensity for batch=128 GEMM on a 4096 model is ~1000+ FLOPs/byte, above H100 breakeven
Symptom: compute utilization is high (> 60%), memory bandwidth utilization is moderate

What to do: Enable FP8 via TransformerEngine (2x throughput), ensure tensor core utilization by padding batch dimensions to multiples of 16 (FP16) or 32 (FP8), use FlashAttention v2/v3 for memory-efficient attention (reduces HBM reads/writes for attention activations).

The mixed reality: Production LLM inference is a mix of prefill (compute-bound) and decode (memory-bound) phases. Systems like vLLM and TGI handle this by separating prefill and decode batching strategies, using continuous batching to maximize batch sizes during decode, and applying prefix caching to avoid recomputing common prefixes. Architecture choice (H100 with 3.35 TB/s vs A100 with 2.0 TB/s) most directly impacts the decode phase, where bandwidth is the constraint.

Summary

GPU architecture generations are not marketing milestones. They represent specific hardware capabilities that change what training and inference strategies are practical.

Ampere (A100) established the foundation: TF32 for zero-code 8x speedup over FP32, 2:4 structured sparsity for 2x inference throughput, MIG for multi-tenant isolation, and NVLink 3.0 for fast multi-GPU training. The A100 remains a highly capable and cost-effective platform for most training workloads.

Hopper (H100) rebuilt the architecture around transformer workloads. The Transformer Engine with dynamic FP8 scaling adds another 2x throughput for transformer GEMMs. HBM3 at 3.35 TB/s improves memory-bandwidth-bound inference. NVLink 4.0 at 900 GB/s reduces communication overhead for tensor-parallel training. For LLM training at scale, H100 is the dominant choice where cost allows.

Ada Lovelace (RTX 4090) is the consumer path: 24GB GDDR6X, no FP8, no NVLink, but genuine Tensor Core capability for local inference and experimentation. For quantized 7B-13B model inference and local research workflows, it is an excellent value. For production serving or large-scale training, its memory bandwidth ceiling and lack of NVLink are disqualifying.

The decisions that follow from this knowledge: which GPU to provision, which precision format to enable, when to apply sparsity, how to structure multi-GPU training, and when to switch architectures as workloads scale.

Why This Exists​

Historical Context - From Volta to Hopper​

Ampere Architecture - A100 Deep Dive​

Physical Specs and What They Mean​

TF32: The Zero-Code Speedup​

Third-Generation Tensor Cores and Sparsity​

Multi-Instance GPU (MIG)​

NVLink 3.0 and Multi-GPU Communication​

Hopper Architecture - H100 Deep Dive​

The Transformer Engine​

Using the Transformer Engine in PyTorch​

H100 Physical Specifications​

DPX Instructions​

NVSwitch 3.0 and NVLink 4.0​

Ada Lovelace Architecture - RTX 4090 Deep Dive​

Positioning: Consumer vs Data Center​

Where Ada Makes Sense for ML​

DLSS 3 Frame Generation and AV1​

Architecture Comparison for LLM Training​

FP8 Precision Formats in Depth​

Mermaid: Ampere vs Hopper Feature Comparison​

Production Engineering Notes​

Choosing Between A100 and H100 for LLM Training​

Maximizing Throughput on A100​

Memory Bandwidth Arithmetic​

Mermaid: Memory Hierarchy and Bandwidth​

Common Mistakes​

Interview Q&A​

Q1: What is the Transformer Engine in the H100, and what problem does it solve that Ampere could not?​

Q2: Explain Multi-Instance GPU (MIG) on the A100. When would you use it, and what are its limitations?​

Q3: You have a training budget for either 8x A100 80GB or 4x H100 80GB. Same total cost per hour. Which do you choose for a 13B transformer training run, and why?​

Q4: What is 2:4 structured sparsity, how does the A100 accelerate it, and what is the correct workflow to apply it?​

Q5: Describe the TF32 data format. What did NVIDIA change in its mantissa and exponent compared to FP32 and FP16, and what is the practical implication for training stability?​

Q6: For inference serving at scale on the H100, when is memory bandwidth the bottleneck and when is compute the bottleneck? What can you do about each?​

Summary​