Tensor Core Programming
Reading time: ~42 min · Interview relevance: Very High · Target roles: CUDA Developer, ML Systems Engineer, GPU Kernel Engineer
An A100 delivers 312 TFLOPS of FP16 Tensor Core throughput and 19.5 TFLOPS of FP32 CUDA core throughput. The ratio is 16:1. If your matrix multiply is not using Tensor Cores, you are running a kernel that is 16x slower than it needs to be - on the same hardware.
The Bug That Cost a Week
The engineer had done everything right - or so it appeared.
The matrix dimensions were reasonable: M=4096, N=4096, K=4096. The kernel was properly tiled. Shared memory was used correctly. Global memory accesses were perfectly coalesced. The kernel launched without error and produced numerically correct results.
But Nsight Compute showed Tensor Core utilization at exactly 0%.
The SM was burning through FP32 CUDA cores at 15 TFLOPS while the Tensor Core units - capable of 312 TFLOPS on the same chip - sat completely idle. The kernel took 8.3 seconds. A cuBLAS call on the same inputs took 0.52 seconds.
The root cause: the input matrices were allocated as float (FP32). Not half (FP16). Not __nv_bfloat16. Not int8. Just plain 32-bit floats.
Tensor Cores on H100 have a strict contract: they will operate on FP16, BF16, TF32, FP8, INT8, or INT4. They will not touch FP32 inputs. There is no hardware fallback that uses Tensor Cores for FP32 - the hardware simply routes FP32 matrix multiplies to the regular CUDA FMUL/FADD pipeline.
The engineer changed two lines: float to half, and the malloc sizes accordingly. Recompiled. Ran again. Tensor Core utilization jumped to 87%. Runtime dropped from 8.3 seconds to 0.58 seconds. A 14x speedup from a two-line change.
Understanding what activates Tensor Cores - and what silently falls back to CUDA cores - is not optional knowledge for an ML systems engineer. This lesson covers every requirement, every API layer, and every diagnostic technique you need to never waste a Tensor Core cycle again.
Why This Exists - The Arithmetic Bottleneck of Deep Learning
The Problem: Deep Learning is Dense Matrix Multiply
Every transformer layer is dominated by three operations: attention score computation (QK^T), attention weighted sum (AV), and feed-forward projection (linear layers). All three are matrix multiplications. A GPT-3 scale model during training spends roughly 90% of its compute time in matrix multiply.
For FP32 matrix multiply on a Kepler-era GPU (2012), peak throughput was around 1 TFLOPS. By Volta (2017), FP32 CUDA core throughput was around 14 TFLOPS. Progress was driven by higher clock speeds and wider execution units.
But the math requirements of deep learning were growing faster. A GPT-3 training run required roughly 3.14 x 10^23 floating-point operations. At 14 TFLOPS per GPU, this would require years of compute even on a cluster.
The architectural response was Tensor Cores.
The Solution: Domain-Specific Matrix Hardware
Volta (2017) introduced the first Tensor Cores with the insight that deep learning matrix multiply has a specific structure the hardware could exploit: the computation is D = A * B + C where A, B, C, D are small matrix tiles (4x4 in Volta, 16x16 in later generations), the accumulation is FP32 but the inputs can be FP16 (because gradients and activations do not require FP32 precision), and the same operation repeats millions of times in a training run.
A Tensor Core is a dedicated hardware unit that performs the full D = A*B+C tile operation in a single clock cycle. It is not a generalization of the FMUL/FADD pipeline - it is a completely separate circuit wired to do exactly one thing: multiply two matrix fragments and accumulate the result.
The trade-off: this specialization comes with a hard contract. Feed the Tensor Core the wrong data type, the wrong dimensions, or misaligned memory - and it does nothing. The workload falls back to CUDA cores. There is no partial acceleration, no automatic type conversion, no silent degradation. It is all or nothing.
Historical Context - Tensor Core Evolution Across Architectures
| Architecture | Year | Tensor Core Type | Peak FP16 (TFLOPS) | Tile Size | New Data Types |
|---|---|---|---|---|---|
| Volta V100 | 2017 | 1st gen | 125 | 4x4x4 | FP16 in, FP32 acc |
| Turing T4 | 2018 | 2nd gen | 65 | 8x8x4 | INT8, INT4 |
| Ampere A100 | 2020 | 3rd gen | 312 | 16x16x16 | BF16, TF32, FP64 |
| Hopper H100 | 2022 | 4th gen | 989 (FP8) | 16x8x16 | FP8 (E4M3/E5M2) |
| Blackwell B200 | 2024 | 5th gen | 4500 (FP4) | varies | FP4, FP6 |
The Ampere A100 was the first architecture where TF32 (TensorFloat-32) allowed FP32 inputs to be automatically truncated to 10-bit mantissa and processed through Tensor Cores - without programmer intervention. This is why modern PyTorch applications see Tensor Core utilization even when using FP32: the framework enables TF32 by default and the hardware uses Tensor Cores transparently.
Hopper introduced the Warpgroup MMA (WGMMA) instruction, operating on warpgroup tiles (128 threads = 4 warps together), with tiles up to 256x128x16. This is the instruction that Triton generates when you write tl.dot() on H100.
Core Concept 1 - Tensor Core Activation Requirements
The Hard Requirements (H100)
For Tensor Cores to activate, all of the following must be satisfied simultaneously. Violating any single requirement silently falls back to CUDA cores.
1. Data type requirement:
| Accumulator (C/D) | Input A/B types | Architecture minimum |
|---|---|---|
| FP32 | FP16, BF16, TF32, FP8 | Volta |
| FP32 | TF32 (automatic from FP32) | Ampere+ |
| FP32 | FP64 | Ampere+ |
| INT32 | INT8 | Turing |
| FP32 | FP8 E4M3, E5M2 | Hopper |
2. Dimension requirements (FP16/BF16 on H100):
- M must be a multiple of 16
- N must be a multiple of 16
- K must be a multiple of 16
For INT8: multiples of 16 for M and N, multiples of 32 for K.
3. Memory alignment:
- All matrix pointers must be 16-byte aligned (4 FP16 elements = 8 bytes minimum, but 16 bytes required for WMMA)
- For float4 (128-bit) aligned loads into shared memory: 16-byte alignment
cudaMallocalways returns 256-byte aligned memory - safe- Subarrays via pointer arithmetic may not be aligned - check explicitly
4. Layout requirement:
- For WMMA: row-major or column-major layout specified at compile time via template parameters
- For MMA PTX: specific layout encoded in the instruction mnemonic
TF32 - The Automatic Tensor Core Path for FP32
Ampere introduced TF32 (TensorFloat-32): a 19-bit format with FP32 exponent range (8 bits) but only 10-bit mantissa. It is not FP32, but the hardware automatically truncates FP32 inputs to TF32 before feeding them to Tensor Cores.
In PyTorch:
# This is ON by default since PyTorch 1.7 on Ampere+
torch.backends.cuda.matmul.allow_tf32 = True # FP32 matmul uses TF32 TC
torch.backends.cudnn.allow_tf32 = True # Conv2d uses TF32 TC
# This is why profiling your PyTorch training shows Tensor Core utilization
# even when you haven't explicitly used half precision
The precision loss from TF32 is ~3 decimal digits vs FP32's 7. For most neural network training this is acceptable and the 10x throughput gain is significant. But for scientific computing with FP32, TF32 can cause unexpected numerical errors.
Core Concept 2 - The WMMA API
Fragments: The Fundamental Abstraction
The WMMA (Warp Matrix Multiply-Accumulate) API exposes Tensor Core operations through the concept of a fragment. A fragment is a tile of a matrix distributed cooperatively across all 32 threads in a warp. No single thread holds the full tile - the 16x16 matrix is split across 32 threads in an architecture-specific layout.
The programmer does not need to know exactly which element each thread holds. The WMMA API handles the distribution automatically through load_matrix_sync and store_matrix_sync. The layout is opaque to the user - you load, multiply, store, and the hardware arranges the data correctly.
#include <mma.h>
using namespace nvcuda;
// Fragment type declarations
// Template args: use (matrix_a/b/c/d), rows, cols, K, data type, layout
wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::row_major> a_frag;
wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::col_major> b_frag;
wmma::fragment<wmma::accumulator, 16, 16, 16, float> c_frag;
wmma::fragment<wmma::accumulator, 16, 16, 16, float> d_frag;
// Load A tile from shared memory (row-major, stride = K)
wmma::load_matrix_sync(a_frag, smem_A_ptr, K);
// Load B tile from shared memory (col-major, stride = N)
wmma::load_matrix_sync(b_frag, smem_B_ptr, N);
// Load C accumulator (initialize to zeros if starting new tile)
wmma::fill_fragment(c_frag, 0.0f);
// OR load existing partial result:
// wmma::load_matrix_sync(c_frag, smem_C_ptr, N, wmma::mem_row_major);
// Tensor Core MMA: d_frag = a_frag * b_frag + c_frag
wmma::mma_sync(d_frag, a_frag, b_frag, c_frag);
// Store result back to shared memory
wmma::store_matrix_sync(smem_D_ptr, d_frag, N, wmma::mem_row_major);
Every WMMA call is warp-collective - all 32 threads in the warp must call it together, with the same arguments. Divergence within a warp calling WMMA produces undefined behavior (typically a hang or incorrect results).
Standard Tile Sizes
The canonical tile size for FP16/BF16 on WMMA is 16x16x16: M=16, N=16, K=16. This is the most tested and commonly used configuration.
WMMA also supports non-square tiles on some architectures:
| M | N | K | Input dtype | Accumulator |
|---|---|---|---|---|
| 16 | 16 | 16 | fp16 | fp32 |
| 32 | 8 | 16 | fp16 | fp32 |
| 8 | 32 | 16 | fp16 | fp32 |
| 16 | 16 | 16 | bf16 | fp32 |
| 8 | 8 | 4 | fp64 | fp64 |
| 16 | 16 | 16 | int8 | int32 |
For H100 Hopper, the preferred interface is no longer WMMA but WGMMA (warpgroup MMA), which operates on 128-thread warpgroups and supports much larger tiles. Triton abstracts this automatically via tl.dot().
Core Concept 3 - MMA PTX Instructions
Below WMMA: The PTX Layer
WMMA compiles down to MMA PTX instructions. Triton, FlashAttention, and CUTLASS all use MMA PTX directly rather than WMMA, because it gives finer control over register layout, instruction scheduling, and double-buffering.
The MMA PTX instruction for FP16 input, FP32 accumulation looks like this:
mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32
{%f0, %f1, %f2, %f3}, // D (accumulator output, 4 f32 per thread)
{%r0, %r1, %r2, %r3}, // A (4 fp16 per thread, packed as 2 uint32)
{%r4, %r5}, // B (2 fp16 per thread, packed as 1 uint32)
{%f0, %f1, %f2, %f3}; // C (accumulator input, 4 f32 per thread)
Decoding the instruction name:
mma.sync: warp-synchronous (all 32 threads execute together)aligned: 16-byte aligned memory (required)m16n8k16: M=16, N=8, K=16 tile dimensionsrow.col: A is row-major, B is column-majorf32.f16.f16.f32: output=f32, A=f16, B=f16, accumulator=f32
The register count per thread for an m16n8k16 MMA:
- A: 8 fp16 values = 4 uint32 registers (2 fp16 packed per uint32)
- B: 4 fp16 values = 2 uint32 registers
- C/D: 4 fp32 values = 4 float registers
This low-level view explains why Tensor Core programming is hard to get right manually: the register layout is architecture-specific and the hardware requires precise packing. WMMA, Triton, and CUTLASS all handle this packing for you.
Inline PTX in CUDA
For situations where WMMA is too high-level but writing raw PTX is necessary:
// Inline PTX MMA instruction in CUDA kernel
// This is what CUTLASS and hand-optimized kernels do
__device__ void mma_m16n8k16_f16_f32(
float (&d)[4], // output
uint32_t const (&a)[4], // A fragments (fp16 packed)
uint32_t const (&b)[2], // B fragments (fp16 packed)
float const (&c)[4]) // accumulator input
{
asm volatile(
"mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 "
"{%0,%1,%2,%3},"
"{%4,%5,%6,%7},"
"{%8,%9},"
"{%10,%11,%12,%13};"
: "=f"(d[0]), "=f"(d[1]), "=f"(d[2]), "=f"(d[3])
: "r"(a[0]), "r"(a[1]), "r"(a[2]), "r"(a[3]),
"r"(b[0]), "r"(b[1]),
"f"(c[0]), "f"(c[1]), "f"(c[2]), "f"(c[3])
);
}
This is the level at which FlashAttention-2 is implemented. Unless you are writing a library, use WMMA or Triton instead.
Core Concept 4 - Complete WMMA Kernel
FP16 GEMM Using WMMA API
This is a complete, minimal GEMM kernel using WMMA. It demonstrates the tiling structure required: the outer loops tile the output matrix into WMMA_M x WMMA_N blocks, and the K loop accumulates K-dimension tiles:
#include <cuda_fp16.h>
#include <mma.h>
#include <cuda_runtime.h>
using namespace nvcuda;
// Tile sizes for WMMA
const int WMMA_M = 16;
const int WMMA_N = 16;
const int WMMA_K = 16;
// Block tile sizes (multiple of WMMA tile)
const int BLOCK_ROW_TILES = 4; // 4 * 16 = 64 rows per block
const int BLOCK_COL_TILES = 4; // 4 * 16 = 64 cols per block
// Each warp computes one WMMA_M x WMMA_N output tile
// 4x4 = 16 warps per block (512 threads / 32 = 16 warps)
__global__ void wmma_gemm_fp16(
const half* __restrict__ A, // M x K, row-major
const half* __restrict__ B, // K x N, row-major
float* __restrict__ C, // M x N, row-major, accumulator
int M, int N, int K)
{
// Shared memory tiles for A and B
// Pad by 8 elements to avoid shared memory bank conflicts
__shared__ half s_A[BLOCK_ROW_TILES * WMMA_M][WMMA_K + 8];
__shared__ half s_B[WMMA_K][BLOCK_COL_TILES * WMMA_N + 8];
// Warp and lane identification
int warp_id = threadIdx.x / 32;
int lane_id = threadIdx.x % 32;
// Which output tile this warp is responsible for
int warp_row = warp_id / BLOCK_COL_TILES; // row within block tile
int warp_col = warp_id % BLOCK_COL_TILES; // col within block tile
// Global row/col offset for this block
int block_row = blockIdx.x * (BLOCK_ROW_TILES * WMMA_M);
int block_col = blockIdx.y * (BLOCK_COL_TILES * WMMA_N);
// Row/col of the output tile this warp computes
int out_row = block_row + warp_row * WMMA_M;
int out_col = block_col + warp_col * WMMA_N;
// WMMA accumulator fragment - initialized to zero
wmma::fragment<wmma::accumulator, WMMA_M, WMMA_N, WMMA_K, float> acc_frag;
wmma::fill_fragment(acc_frag, 0.0f);
// Loop over K tiles
for (int k_tile = 0; k_tile < K; k_tile += WMMA_K) {
// Cooperative load of A tile into shared memory
// All threads in the block load together
for (int i = threadIdx.x; i < BLOCK_ROW_TILES * WMMA_M * WMMA_K;
i += blockDim.x) {
int row = i / WMMA_K;
int col = i % WMMA_K;
int global_row = block_row + row;
int global_col = k_tile + col;
s_A[row][col] = (global_row < M && global_col < K)
? A[global_row * K + global_col]
: __float2half(0.0f);
}
// Cooperative load of B tile into shared memory
for (int i = threadIdx.x; i < WMMA_K * BLOCK_COL_TILES * WMMA_N;
i += blockDim.x) {
int row = i / (BLOCK_COL_TILES * WMMA_N);
int col = i % (BLOCK_COL_TILES * WMMA_N);
int global_row = k_tile + row;
int global_col = block_col + col;
s_B[row][col] = (global_row < K && global_col < N)
? B[global_row * N + global_col]
: __float2half(0.0f);
}
__syncthreads();
// Each warp loads its WMMA tile and executes MMA
wmma::fragment<wmma::matrix_a, WMMA_M, WMMA_N, WMMA_K,
half, wmma::row_major> a_frag;
wmma::fragment<wmma::matrix_b, WMMA_M, WMMA_N, WMMA_K,
half, wmma::row_major> b_frag;
// Load from shared memory - stride is the full shared mem row width
const half* smem_a_ptr = &s_A[warp_row * WMMA_M][0];
const half* smem_b_ptr = &s_B[0][warp_col * WMMA_N];
wmma::load_matrix_sync(a_frag, smem_a_ptr, WMMA_K + 8);
wmma::load_matrix_sync(b_frag, smem_b_ptr, BLOCK_COL_TILES * WMMA_N + 8);
// Tensor Core MMA: acc_frag += a_frag * b_frag
wmma::mma_sync(acc_frag, a_frag, b_frag, acc_frag);
__syncthreads();
}
// Store the accumulated result to global memory
if (out_row < M && out_col < N) {
wmma::store_matrix_sync(&C[out_row * N + out_col], acc_frag, N,
wmma::mem_row_major);
}
}
// Launch configuration
void launch_wmma_gemm(const half* A, const half* B, float* C,
int M, int N, int K) {
// Grid: cover M rows in BLOCK_ROW_TILES*WMMA_M chunks
// cover N cols in BLOCK_COL_TILES*WMMA_N chunks
dim3 grid((M + BLOCK_ROW_TILES * WMMA_M - 1) / (BLOCK_ROW_TILES * WMMA_M),
(N + BLOCK_COL_TILES * WMMA_N - 1) / (BLOCK_COL_TILES * WMMA_N));
// 512 threads per block = 16 warps = 16 WMMA tiles per block
dim3 block(512);
wmma_gemm_fp16<<<grid, block>>>(A, B, C, M, N, K);
}
Understanding the Warp Tile Layout
Core Concept 5 - Triton: Automatic Tensor Core Dispatch
Why Triton for Tensor Cores?
Triton is a Python-embedded DSL for writing GPU kernels that automatically maps tl.dot() to Tensor Core MMA instructions when the conditions are met. It handles:
- Register fragment layout automatically
- WGMMA instruction generation for Hopper
- Software pipelining of loads with computation
- Shape compliance checking at compile time
For most engineers writing custom attention mechanisms, sparse operations, or non-standard GEMMs, Triton is the right level of abstraction. WMMA directly is error-prone. MMA PTX is for library authors.
Triton GEMM that Uses Tensor Cores
import triton
import triton.language as tl
import torch
@triton.jit
def matmul_kernel(
A_ptr, B_ptr, C_ptr,
M, N, K,
stride_am, stride_ak,
stride_bk, stride_bn,
stride_cm, stride_cn,
# Block sizes as constexpr - MUST be compile-time constants for Tensor Cores
BLOCK_M: tl.constexpr, # must be multiple of 16
BLOCK_N: tl.constexpr, # must be multiple of 16
BLOCK_K: tl.constexpr, # must be multiple of 16
):
# Program ID
pid_m = tl.program_id(0)
pid_n = tl.program_id(1)
# Tile offsets
offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
offs_k = tl.arange(0, BLOCK_K)
# Pointers to current tiles
A_tile_ptr = A_ptr + offs_m[:, None] * stride_am + offs_k[None, :] * stride_ak
B_tile_ptr = B_ptr + offs_k[:, None] * stride_bk + offs_n[None, :] * stride_bn
# FP32 accumulator
acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
# K loop - tl.dot uses Tensor Cores automatically
for k in range(0, K, BLOCK_K):
# Load tiles from global memory
a = tl.load(A_tile_ptr, mask=offs_m[:, None] < M, other=0.0)
b = tl.load(B_tile_ptr, mask=offs_n[None, :] < N, other=0.0)
# tl.dot maps to WMMA/WGMMA Tensor Core instruction
# REQUIRES: a and b must be fp16 or bf16 for Tensor Core activation
# REQUIRES: BLOCK_M, BLOCK_N, BLOCK_K must be multiples of 16
acc += tl.dot(a, b)
A_tile_ptr += BLOCK_K * stride_ak
B_tile_ptr += BLOCK_K * stride_bk
# Write output
offs_cm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
offs_cn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
C_tile_ptr = C_ptr + offs_cm[:, None] * stride_cm + offs_cn[None, :] * stride_cn
tl.store(C_tile_ptr, acc, mask=(offs_cm[:, None] < M) & (offs_cn[None, :] < N))
def matmul_triton(A: torch.Tensor, B: torch.Tensor) -> torch.Tensor:
M, K = A.shape
K2, N = B.shape
assert K == K2
# CRITICAL: inputs must be fp16 or bf16 for Tensor Cores
assert A.dtype in (torch.float16, torch.bfloat16), \
f"Expected fp16/bf16, got {A.dtype} - Tensor Cores will NOT activate with fp32"
C = torch.zeros((M, N), dtype=torch.float32, device=A.device)
# Grid: one program per output block
grid = lambda meta: (
triton.cdiv(M, meta['BLOCK_M']),
triton.cdiv(N, meta['BLOCK_N']),
)
matmul_kernel[grid](
A, B, C,
M, N, K,
A.stride(0), A.stride(1),
B.stride(0), B.stride(1),
C.stride(0), C.stride(1),
# Block sizes MUST be multiples of 16 for Tensor Core activation
BLOCK_M=64, BLOCK_N=64, BLOCK_K=32,
)
return C
# Verify Tensor Core usage
if __name__ == "__main__":
M, N, K = 4096, 4096, 4096
# This WILL use Tensor Cores
A_fp16 = torch.randn(M, K, dtype=torch.float16, device='cuda')
B_fp16 = torch.randn(K, N, dtype=torch.float16, device='cuda')
C = matmul_triton(A_fp16, B_fp16)
print("FP16 matmul - Tensor Cores should be active")
# Profile with Nsight to verify:
# ncu --metrics sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_active
# python this_script.py
Triton Tensor Core Activation Checklist
# WILL use Tensor Cores:
BLOCK_M=64, BLOCK_N=64, BLOCK_K=32 # all multiples of 16 - good
A.dtype = torch.float16 # fp16 - good
B.dtype = torch.float16 # fp16 - good
# WILL NOT use Tensor Cores:
BLOCK_M=60, BLOCK_N=64, BLOCK_K=32 # 60 is not a multiple of 16 - bad
A.dtype = torch.float32 # fp32 without TF32 mode - bad
BLOCK_K=8 # too small for H100 WGMMA - bad
Core Concept 6 - CUTLASS
What CUTLASS Is and When to Use It
CUTLASS (CUDA Templates for Linear Algebra Subroutines and Solvers) is NVIDIA's open-source template library for high-performance GEMM. It generates Tensor Core kernels with:
- Double-buffering of shared memory tiles (software pipelining)
- Optimal register file layout for the target architecture
- Epilogue fusion (add bias, apply relu, etc. without extra kernel launches)
- Architecture-specific optimizations compiled in
CUTLASS typically matches cuBLAS performance and sometimes exceeds it for non-standard epilogues.
#include <cutlass/gemm/device/gemm.h>
// Define the GEMM operation type
using ElementA = cutlass::half_t; // FP16 input A
using ElementB = cutlass::half_t; // FP16 input B
using ElementC = float; // FP32 accumulator
using ElementOutput = float; // FP32 output
using LayoutA = cutlass::layout::RowMajor;
using LayoutB = cutlass::layout::ColumnMajor; // col-major B is optimal for TC
using LayoutC = cutlass::layout::RowMajor;
// Tile sizes (must satisfy Tensor Core requirements)
using ThreadblockShape = cutlass::gemm::GemmShape<128, 128, 32>;
using WarpShape = cutlass::gemm::GemmShape<64, 64, 32>;
using InstructionShape = cutlass::gemm::GemmShape<16, 8, 16>; // Tensor Core tile
// CUTLASS GEMM type - generates Tensor Core kernel automatically
using Gemm = cutlass::gemm::device::Gemm<
ElementA, LayoutA,
ElementB, LayoutB,
ElementC, LayoutC,
ElementOutput,
cutlass::arch::OpClassTensorOp, // USE TENSOR CORES
cutlass::arch::Sm80, // Ampere (use Sm90 for Hopper)
ThreadblockShape, WarpShape, InstructionShape,
cutlass::epilogue::thread::LinearCombination<ElementOutput, 1, ElementC, float>,
cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>,
4 // pipeline stages (double-buffering = 2, quad-buffering = 4)
>;
// Run the GEMM
void run_cutlass_gemm(half* A, half* B, float* C, float* D,
int M, int N, int K, float alpha, float beta) {
typename Gemm::Arguments args(
{M, N, K},
{A, K}, // A ptr, leading dimension
{B, N}, // B ptr, leading dimension
{C, N}, // C ptr, leading dimension
{D, N}, // D ptr, leading dimension
{alpha, beta}
);
Gemm gemm_op;
cutlass::Status status = gemm_op(args);
if (status != cutlass::Status::kSuccess) {
printf("CUTLASS GEMM failed: %s\n",
cutlass::cutlassGetStatusString(status));
}
}
CUTLASS vs cuBLAS vs WMMA vs Triton - when to use each:
| Tool | Use case | Performance | Code complexity |
|---|---|---|---|
| cuBLAS | Standard GEMM, no custom epilogue | Highest | Minimal |
| CUTLASS | Custom epilogue (fused ops), non-standard layout | cuBLAS-level | Medium |
| Triton | Custom attention, non-square ops, research | Near-optimal | Low-Medium |
| WMMA direct | Learning, prototyping, simple kernels | Good | High |
| MMA PTX | Library development, FlashAttention-level | Optimal | Very High |
Core Concept 7 - Diagnosing Zero Tensor Core Utilization
The Diagnostic Workflow
When Nsight shows 0% Tensor Core utilization on a matrix multiply kernel, follow this checklist:
Step 1: Check data types
# In Python - most common cause
print(f"A dtype: {A.dtype}") # must be fp16, bf16, or int8
print(f"B dtype: {B.dtype}") # same
# If FP32 on Ampere+, check TF32 setting:
print(torch.backends.cuda.matmul.allow_tf32) # should be True
// In CUDA - check your fragment type
// WRONG - FP32 inputs do not activate Tensor Cores
wmma::fragment<wmma::matrix_a, 16, 16, 16, float, wmma::row_major> a_frag; // float!
// RIGHT - FP16 inputs
wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::row_major> a_frag;
Step 2: Verify dimension multiples
M, N, K = A.shape[0], B.shape[1], A.shape[1]
assert M % 16 == 0, f"M={M} is not a multiple of 16"
assert N % 16 == 0, f"N={N} is not a multiple of 16"
assert K % 16 == 0, f"K={K} is not a multiple of 16"
Step 3: Check memory alignment
// Verify 16-byte alignment
assert(((uintptr_t)A_ptr % 16) == 0);
assert(((uintptr_t)B_ptr % 16) == 0);
assert(((uintptr_t)C_ptr % 16) == 0);
Step 4: Verify with Nsight Compute
# The definitive Tensor Core metric
ncu --metrics \
sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_active,\
sm__pipe_fma_cycles_active.avg.pct_of_peak_sustained_active \
./myprogram
# If tensor > 0%: Tensor Cores are firing
# If tensor == 0% and fma > 0%: CUDA cores only
# If both == 0%: memory bound or stalled
Step 5: Inspect PTX/SASS
# Look for mma instructions in SASS
cuobjdump --dump-sass myprogram | grep -i "mma\|HMMA\|DMMA\|IMMA"
# If no mma instructions found: compiler did not generate Tensor Core code
# Check WMMA calls and types
The Nsight Roofline for Tensor Cores
After fixing Tensor Core utilization, use the Roofline model to understand if you are getting near-peak performance:
Production Engineering Notes
Mixed Precision Training Pattern
The standard production pattern for Tensor Core usage in training:
import torch
from torch.cuda.amp import autocast, GradScaler
# Enable TF32 for FP32 ops (uses Tensor Cores with truncated mantissa)
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
model = MyTransformerModel().cuda()
optimizer = torch.optim.Adam(model.parameters())
scaler = GradScaler() # loss scaling for FP16 stability
for batch in dataloader:
optimizer.zero_grad()
# autocast: linear layers run in FP16 -> Tensor Cores active
# loss stays in FP32 for numerical stability
with autocast(dtype=torch.float16):
output = model(batch['input'])
loss = loss_fn(output, batch['target'])
# scaler prevents FP16 underflow in gradients
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
The autocast context manager automatically casts eligible ops to FP16 (activating Tensor Cores) while keeping loss computation in FP32. The GradScaler multiplies the loss by a large scalar before backward pass to prevent gradient underflow in FP16.
Padding Matrices to Tensor Core Multiples
When your actual matrix dimensions are not multiples of 16 (or 8 for INT8), pad them:
def pad_to_multiple(tensor: torch.Tensor, multiple: int = 16) -> torch.Tensor:
"""Pad a matrix's dimensions to multiples of `multiple` for Tensor Cores."""
m, n = tensor.shape
pad_m = (multiple - m % multiple) % multiple
pad_n = (multiple - n % multiple) % multiple
if pad_m == 0 and pad_n == 0:
return tensor
return torch.nn.functional.pad(tensor, (0, pad_n, 0, pad_m))
# In a linear layer:
A = pad_to_multiple(A_original, 16) # pad M and K
B = pad_to_multiple(B_original, 16) # pad K and N
C = torch.mm(A, B)
C_trimmed = C[:M_original, :N_original] # trim padding from output
For transformer models, it is better to pad the embedding dimension and vocabulary size at initialization time rather than padding every forward pass.
Verifying Tensor Core Use in PyTorch
import torch
from torch.profiler import profile, ProfilerActivity
model = torch.nn.Linear(4096, 4096, dtype=torch.float16).cuda()
x = torch.randn(64, 4096, dtype=torch.float16, device='cuda')
with profile(activities=[ProfilerActivity.CUDA],
with_flops=True) as prof:
for _ in range(10):
y = model(x)
# Print top operations by CUDA time
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
# Look for: 'aten::mm' or 'aten::linear' - should show high FLOPS
# If TFLOPS shown is close to 312 (A100) or 989 (H100 FP8):
# Tensor Cores are active
Batch Dimension and BMMD
Batched matrix multiply (torch.bmm, torch.baddbmm) uses Tensor Cores if:
- All conditions above are met for each sub-matrix
- The batch dimension does not fragment the computation into tiny tiles
For batch sizes of 1 or very small (< 32), Tensor Core utilization often drops because the M and N tiles are too small to fill the SM. Fuse the batch dimension into M for small-batch inference:
# Small batch: [B, seq_len, d_model] -> [B * seq_len, d_model]
# This gives larger M dimension, better TC utilization
x_reshaped = x.view(-1, d_model) # merge batch and seq_len
y_reshaped = linear(x_reshaped)
y = y_reshaped.view(B, seq_len, -1)
Common Mistakes
:::danger Wrong data type is the most common cause of 0% Tensor Core utilization
The single most common Tensor Core bug: using float (FP32) inputs without enabling TF32 mode. FP32 inputs to wmma::load_matrix_sync using half fragment type is a compile error. But float inputs to torch.mm without TF32 mode quietly fall back to CUDA cores with no warning. Always explicitly check A.dtype before performance-critical matrix operations and assert that it is torch.float16 or torch.bfloat16.
:::
:::danger All 32 threads in a warp must call WMMA functions together
WMMA operations are warp-collective. If any thread in the warp takes a different code path and skips the wmma::mma_sync call, the entire warp hangs or produces incorrect results. Never put wmma::mma_sync inside an if block that could be skipped by some threads in the warp. Handle boundary conditions by padding the matrices to Tensor Core multiples before the kernel, not with conditional logic inside the kernel.
:::
:::danger Fragment stride argument is not the tile size, it is the matrix leading dimension
The stride argument to wmma::load_matrix_sync is the leading dimension of the source matrix in memory, not the tile size. For a row-major matrix A of shape MxK loaded from shared memory with shared memory width SMEM_WIDTH, the stride is SMEM_WIDTH, not K. Getting this wrong produces no compile error but silently loads incorrect data into the fragment.
// WRONG - stride = WMMA_K (tile size)
wmma::load_matrix_sync(a_frag, smem_a_ptr, WMMA_K);
// RIGHT - stride = smem row width (leading dimension)
wmma::load_matrix_sync(a_frag, smem_a_ptr, K_smem_width);
:::
:::warning Accumulator fragments are not automatically initialized to zero
wmma::fill_fragment(acc_frag, 0.0f) must be called explicitly before the first wmma::mma_sync. Uninitialized fragments contain garbage values. The hardware does not reset accumulators between uses - the fragment retains whatever value was in the registers from the previous operation.
:::
:::warning Column-major B matrix layout usually gives better Tensor Core performance The standard GEMM convention (as used in cuBLAS and CUTLASS) uses row-major A and column-major B. With column-major B, the K-dimension tiles of B are contiguous in memory, enabling efficient 128-bit loads from shared memory into Tensor Core fragments. Row-major B requires transposed shared memory layouts and extra shuffles. If you have a choice in how to store B, prefer column-major. :::
:::warning Dimension padding affects correctness if accumulation wraps
When padding matrices to Tensor Core multiples, the padding elements must be zero (not garbage). A padding element in A multiplied by a real element in B contributes to the accumulator. If the padding is zero, this contribution is zero and the result is correct. If padding is garbage, the result is corrupted. Always use torch.zeros or cudaMemset for padding, never torch.empty or uninitialized cudaMalloc.
:::
Interview Questions and Answers
Q1: What is the WMMA API and how do you use it to perform a matrix multiply with Tensor Cores?
WMMA (Warp Matrix Multiply-Accumulate) is the CUDA C++ API in nvcuda::wmma namespace that exposes Tensor Core operations at the warp level. The fundamental abstraction is the fragment, a matrix tile distributed cooperatively across all 32 threads in a warp. You do not control which element each thread holds - the layout is opaque and architecture-specific. The workflow is: (1) declare fragment objects with template parameters specifying role (matrix_a, matrix_b, accumulator), tile dimensions (16x16x16 for fp16), data type (half or float), and layout (row_major/col_major); (2) call wmma::fill_fragment to initialize the accumulator to zero; (3) load tiles from shared memory into fragments using wmma::load_matrix_sync with the correct stride (the matrix's leading dimension, not the tile size); (4) call wmma::mma_sync to perform the Tensor Core D = A*B+C operation; (5) call wmma::store_matrix_sync to write the result back to shared or global memory. All WMMA calls are warp-collective - all 32 threads must call them with the same arguments.
Q2: What are the activation requirements for Tensor Cores on H100, and what happens if any are violated?
H100 Tensor Core activation requires three things simultaneously: correct data types (FP16, BF16, TF32, FP8, INT8, or INT4 - not plain FP32), dimension multiples of 16 for M, N, and K (multiples of 32 for K with INT8), and 16-byte memory alignment for all matrix pointers. If any requirement is violated, the operation falls back to regular CUDA cores silently - no error, no warning, no exception. The fallback runs at FP32 CUDA core throughput (~19.5 TFLOPS on A100), which is 16x slower than FP16 Tensor Core throughput (312 TFLOPS). The most common violation is FP32 input types. On Ampere and later, setting torch.backends.cuda.matmul.allow_tf32 = True enables automatic truncation of FP32 inputs to TF32 (10-bit mantissa), which activates Tensor Cores even for FP32 code at the cost of some precision.
Q3: How do you diagnose and fix zero Tensor Core utilization in Nsight Compute?
The primary metric is sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_active. Zero means Tensor Cores are not active. The diagnostic process: first check data types - print the dtype of input tensors or check fragment type declarations in CUDA code; FP32 without TF32 mode is the most common cause. Second, verify dimension multiples - M, N, K must all be multiples of 16 for FP16/BF16, and K must be a multiple of 32 for some INT8 configurations. Third, check memory alignment using assert(((uintptr_t)ptr % 16) == 0). Fourth, examine the SASS with cuobjdump --dump-sass and look for HMMA (half-precision MMA), DMMA (double), or IMMA (integer) instructions - their absence confirms the compiler did not generate Tensor Core code. Finally, cross-reference with sm__pipe_fma_cycles_active - if this is high while tensor is zero, CUDA cores are active and the workload has fallen back.
Q4: What is the difference between WMMA, MMA PTX, and Triton for Tensor Core programming?
WMMA is the highest-level CUDA C++ API, using C++ templates and warp-collective fragment objects. It is suitable for learning and simple custom kernels but does not expose the register layout needed for aggressive optimization. MMA PTX is the intermediate assembly level - MMA instructions encoded directly in PTX, used in CUTLASS and hand-written high-performance kernels like FlashAttention. It gives full control over register packing, instruction scheduling, and pipeline stages but requires deep knowledge of architecture-specific register layouts and is architecture-specific (different instructions for Turing, Ampere, Hopper). Triton is a Python DSL where tl.dot() automatically maps to the correct Tensor Core instruction for the target GPU - WMMA on Turing/Ampere, WGMMA on Hopper. Triton handles register layout, packing, and software pipelining automatically, achieving near-CUTLASS performance for most operations. For production ML systems: use cuBLAS for standard GEMM, Triton for custom attention or fused operations, CUTLASS for custom epilogues, and WMMA/MMA PTX only when writing a GPU library.
Q5: Why must Tensor Core tile dimensions be multiples of 16, and what is the consequence of using non-multiples?
The 16x16x16 constraint comes from the physical wiring of the Tensor Core hardware. A single Tensor Core unit performs one 16x16x16 matrix multiply in one clock cycle, and the 32 threads of a warp cooperatively feed exactly 16*16 = 256 input elements to it. If your matrix dimensions are not multiples of 16, there is no partial Tensor Core operation - the hardware cannot execute a 14x16x16 multiply. The GPU must either pad the operation internally (it does not) or fall back to scalar CUDA cores. For dimensions that are not multiples of 16, the standard solution is explicit padding: allocate matrices padded to the next multiple of 16, fill the padding region with zeros (not garbage), perform the matrix multiply on the padded dimensions, and trim the output back to the original dimensions. The overhead of padding and trimming is usually negligible compared to the 10-16x throughput gain from activating Tensor Cores.
Q6: Explain the CUTLASS template library and when you would use it over WMMA code or cuBLAS.
CUTLASS is NVIDIA's open-source C++ template library for GEMM that generates highly optimized Tensor Core kernels by composing small building blocks: ThreadblockShape (the tile each block handles), WarpShape (the tile each warp handles), InstructionShape (the Tensor Core MMA tile), and an Epilogue (post-multiply operations like add bias or apply activation). CUTLASS typically matches cuBLAS performance because it uses the same optimization techniques: double-buffering (software pipelining of shared memory loads with MMA execution), optimal register file layout, and architecture-specific instruction sequences. The primary reason to use CUTLASS over cuBLAS is epilogue fusion: if you need to apply a non-standard activation function, add a residual connection, or perform quantization immediately after the matrix multiply, doing it in a separate kernel forces an extra global memory round-trip. CUTLASS fuses arbitrary epilogues into the GEMM kernel at no additional cost. Use cuBLAS when the standard GEMM interface suffices. Use CUTLASS when you need a non-standard epilogue but standard inputs and outputs. Use Triton when you need non-standard tiling, attention-like computation patterns, or rapid iteration.
Q7: What is TF32 and how does it enable Tensor Core usage for FP32 workloads?
TF32 (TensorFloat-32) is a 19-bit floating-point format introduced in Ampere that has the same 8-bit exponent as FP32 (same range) but only 10 bits of mantissa instead of 23 (lower precision, similar to FP16's mantissa width). TF32 is not directly programmable - it is an internal hardware mode. When TF32 mode is enabled (default in PyTorch 1.7+ for Ampere GPUs via torch.backends.cuda.matmul.allow_tf32 = True), the hardware automatically truncates FP32 inputs to TF32 format before feeding them to Tensor Cores, performs the matrix multiply in Tensor Cores with TF32 arithmetic, and produces FP32 outputs. The programmer sees FP32 inputs and FP32 outputs, but the computation uses Tensor Cores and runs at approximately 10x the speed of pure FP32. The precision loss is in the mantissa: TF32 has ~3 decimal digits of precision vs FP32's ~7. For neural network training this is acceptable (gradients are noisy anyway), but for scientific computing or financial applications requiring full FP32 precision, TF32 mode should be disabled with torch.backends.cuda.matmul.allow_tf32 = False.
Summary
Tensor Core programming is not difficult once you understand the contract: the hardware provides a 10-16x throughput multiplier for matrix multiply, and in return it requires exact compliance with data type, dimension, and alignment constraints. There are no partial failures - either all requirements are met and you get Tensor Core speed, or one requirement fails and you silently fall back to CUDA core speed.
The practical hierarchy of tools, from highest to lowest level of abstraction:
- PyTorch autocast + TF32 mode: transparent, covers 90% of training use cases
- Triton
tl.dot(): for custom attention and non-standard patterns, near-optimal performance - CUTLASS: for custom epilogues with standard GEMM, library-quality performance
- WMMA API: for learning and simple custom kernels
- MMA PTX: for library development, achieving the final few percent of peak throughput
The diagnostic tool is always Nsight Compute: sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_active. Zero percent means the requirements are not met. Systematically check dtype first (most common failure), then dimensions, then alignment. The engineer in the opening scenario lost a week to a two-line fix. With this lesson, you will fix it in ten minutes.
