Tensor Core Programming

Reading time: ~42 min · Interview relevance: Very High · Target roles: CUDA Developer, ML Systems Engineer, GPU Kernel Engineer

An A100 delivers 312 TFLOPS of FP16 Tensor Core throughput and 19.5 TFLOPS of FP32 CUDA core throughput. The ratio is 16:1. If your matrix multiply is not using Tensor Cores, you are running a kernel that is 16x slower than it needs to be - on the same hardware.

The Bug That Cost a Week

The engineer had done everything right - or so it appeared.

The matrix dimensions were reasonable: M=4096, N=4096, K=4096. The kernel was properly tiled. Shared memory was used correctly. Global memory accesses were perfectly coalesced. The kernel launched without error and produced numerically correct results.

But Nsight Compute showed Tensor Core utilization at exactly 0%.

The SM was burning through FP32 CUDA cores at 15 TFLOPS while the Tensor Core units - capable of 312 TFLOPS on the same chip - sat completely idle. The kernel took 8.3 seconds. A cuBLAS call on the same inputs took 0.52 seconds.

The root cause: the input matrices were allocated as float (FP32). Not half (FP16). Not __nv_bfloat16. Not int8. Just plain 32-bit floats.

Tensor Cores on H100 have a strict contract: they will operate on FP16, BF16, TF32, FP8, INT8, or INT4. They will not touch FP32 inputs. There is no hardware fallback that uses Tensor Cores for FP32 - the hardware simply routes FP32 matrix multiplies to the regular CUDA FMUL/FADD pipeline.

The engineer changed two lines: float to half, and the malloc sizes accordingly. Recompiled. Ran again. Tensor Core utilization jumped to 87%. Runtime dropped from 8.3 seconds to 0.58 seconds. A 14x speedup from a two-line change.

Understanding what activates Tensor Cores - and what silently falls back to CUDA cores - is not optional knowledge for an ML systems engineer. This lesson covers every requirement, every API layer, and every diagnostic technique you need to never waste a Tensor Core cycle again.

Why This Exists - The Arithmetic Bottleneck of Deep Learning

The Problem: Deep Learning is Dense Matrix Multiply

Every transformer layer is dominated by three operations: attention score computation (QK^T), attention weighted sum (AV), and feed-forward projection (linear layers). All three are matrix multiplications. A GPT-3 scale model during training spends roughly 90% of its compute time in matrix multiply.

For FP32 matrix multiply on a Kepler-era GPU (2012), peak throughput was around 1 TFLOPS. By Volta (2017), FP32 CUDA core throughput was around 14 TFLOPS. Progress was driven by higher clock speeds and wider execution units.

But the math requirements of deep learning were growing faster. A GPT-3 training run required roughly 3.14 x 10^23 floating-point operations. At 14 TFLOPS per GPU, this would require years of compute even on a cluster.

The architectural response was Tensor Cores.

The Solution: Domain-Specific Matrix Hardware

Volta (2017) introduced the first Tensor Cores with the insight that deep learning matrix multiply has a specific structure the hardware could exploit: the computation is D = A * B + C where A, B, C, D are small matrix tiles (4x4 in Volta, 16x16 in later generations), the accumulation is FP32 but the inputs can be FP16 (because gradients and activations do not require FP32 precision), and the same operation repeats millions of times in a training run.

A Tensor Core is a dedicated hardware unit that performs the full D = A*B+C tile operation in a single clock cycle. It is not a generalization of the FMUL/FADD pipeline - it is a completely separate circuit wired to do exactly one thing: multiply two matrix fragments and accumulate the result.

The trade-off: this specialization comes with a hard contract. Feed the Tensor Core the wrong data type, the wrong dimensions, or misaligned memory - and it does nothing. The workload falls back to CUDA cores. There is no partial acceleration, no automatic type conversion, no silent degradation. It is all or nothing.

Historical Context - Tensor Core Evolution Across Architectures

Architecture	Year	Tensor Core Type	Peak FP16 (TFLOPS)	Tile Size	New Data Types
Volta V100	2017	1st gen	125	4x4x4	FP16 in, FP32 acc
Turing T4	2018	2nd gen	65	8x8x4	INT8, INT4
Ampere A100	2020	3rd gen	312	16x16x16	BF16, TF32, FP64
Hopper H100	2022	4th gen	989 (FP8)	16x8x16	FP8 (E4M3/E5M2)
Blackwell B200	2024	5th gen	4500 (FP4)	varies	FP4, FP6

The Ampere A100 was the first architecture where TF32 (TensorFloat-32) allowed FP32 inputs to be automatically truncated to 10-bit mantissa and processed through Tensor Cores - without programmer intervention. This is why modern PyTorch applications see Tensor Core utilization even when using FP32: the framework enables TF32 by default and the hardware uses Tensor Cores transparently.

Hopper introduced the Warpgroup MMA (WGMMA) instruction, operating on warpgroup tiles (128 threads = 4 warps together), with tiles up to 256x128x16. This is the instruction that Triton generates when you write tl.dot() on H100.

Core Concept 1 - Tensor Core Activation Requirements

The Hard Requirements (H100)

For Tensor Cores to activate, all of the following must be satisfied simultaneously. Violating any single requirement silently falls back to CUDA cores.

1. Data type requirement:

Accumulator (C/D)	Input A/B types	Architecture minimum
FP32	FP16, BF16, TF32, FP8	Volta
FP32	TF32 (automatic from FP32)	Ampere+
FP32	FP64	Ampere+
INT32	INT8	Turing
FP32	FP8 E4M3, E5M2	Hopper

2. Dimension requirements (FP16/BF16 on H100):

M must be a multiple of 16
N must be a multiple of 16
K must be a multiple of 16

For INT8: multiples of 16 for M and N, multiples of 32 for K.

3. Memory alignment:

All matrix pointers must be 16-byte aligned (4 FP16 elements = 8 bytes minimum, but 16 bytes required for WMMA)
For float4 (128-bit) aligned loads into shared memory: 16-byte alignment
cudaMalloc always returns 256-byte aligned memory - safe
Subarrays via pointer arithmetic may not be aligned - check explicitly

4. Layout requirement:

For WMMA: row-major or column-major layout specified at compile time via template parameters
For MMA PTX: specific layout encoded in the instruction mnemonic

TF32 - The Automatic Tensor Core Path for FP32

Ampere introduced TF32 (TensorFloat-32): a 19-bit format with FP32 exponent range (8 bits) but only 10-bit mantissa. It is not FP32, but the hardware automatically truncates FP32 inputs to TF32 before feeding them to Tensor Cores.

In PyTorch:

# This is ON by default since PyTorch 1.7 on Ampere+
torch.backends.cuda.matmul.allow_tf32 = True   # FP32 matmul uses TF32 TC
torch.backends.cudnn.allow_tf32 = True          # Conv2d uses TF32 TC

# This is why profiling your PyTorch training shows Tensor Core utilization
# even when you haven't explicitly used half precision

The precision loss from TF32 is ~3 decimal digits vs FP32's 7. For most neural network training this is acceptable and the 10x throughput gain is significant. But for scientific computing with FP32, TF32 can cause unexpected numerical errors.

Core Concept 2 - The WMMA API

Fragments: The Fundamental Abstraction

The WMMA (Warp Matrix Multiply-Accumulate) API exposes Tensor Core operations through the concept of a fragment. A fragment is a tile of a matrix distributed cooperatively across all 32 threads in a warp. No single thread holds the full tile - the 16x16 matrix is split across 32 threads in an architecture-specific layout.

The programmer does not need to know exactly which element each thread holds. The WMMA API handles the distribution automatically through load_matrix_sync and store_matrix_sync. The layout is opaque to the user - you load, multiply, store, and the hardware arranges the data correctly.

#include <mma.h>
using namespace nvcuda;

// Fragment type declarations
// Template args: use (matrix_a/b/c/d), rows, cols, K, data type, layout
wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::row_major> a_frag;
wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::col_major> b_frag;
wmma::fragment<wmma::accumulator, 16, 16, 16, float> c_frag;
wmma::fragment<wmma::accumulator, 16, 16, 16, float> d_frag;

// Load A tile from shared memory (row-major, stride = K)
wmma::load_matrix_sync(a_frag, smem_A_ptr, K);

// Load B tile from shared memory (col-major, stride = N)
wmma::load_matrix_sync(b_frag, smem_B_ptr, N);

// Load C accumulator (initialize to zeros if starting new tile)
wmma::fill_fragment(c_frag, 0.0f);
// OR load existing partial result:
// wmma::load_matrix_sync(c_frag, smem_C_ptr, N, wmma::mem_row_major);

// Tensor Core MMA: d_frag = a_frag * b_frag + c_frag
wmma::mma_sync(d_frag, a_frag, b_frag, c_frag);

// Store result back to shared memory
wmma::store_matrix_sync(smem_D_ptr, d_frag, N, wmma::mem_row_major);

Every WMMA call is warp-collective - all 32 threads in the warp must call it together, with the same arguments. Divergence within a warp calling WMMA produces undefined behavior (typically a hang or incorrect results).

Standard Tile Sizes

The canonical tile size for FP16/BF16 on WMMA is 16x16x16: M=16, N=16, K=16. This is the most tested and commonly used configuration.

WMMA also supports non-square tiles on some architectures:

M	N	K	Input dtype	Accumulator
16	16	16	fp16	fp32
32	8	16	fp16	fp32
8	32	16	fp16	fp32
16	16	16	bf16	fp32
8	8	4	fp64	fp64
16	16	16	int8	int32

For H100 Hopper, the preferred interface is no longer WMMA but WGMMA (warpgroup MMA), which operates on 128-thread warpgroups and supports much larger tiles. Triton abstracts this automatically via tl.dot().

Core Concept 3 - MMA PTX Instructions

Below WMMA: The PTX Layer

WMMA compiles down to MMA PTX instructions. Triton, FlashAttention, and CUTLASS all use MMA PTX directly rather than WMMA, because it gives finer control over register layout, instruction scheduling, and double-buffering.

The MMA PTX instruction for FP16 input, FP32 accumulation looks like this:

mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32
    {%f0, %f1, %f2, %f3},    // D (accumulator output, 4 f32 per thread)
    {%r0, %r1, %r2, %r3},    // A (4 fp16 per thread, packed as 2 uint32)
    {%r4, %r5},               // B (2 fp16 per thread, packed as 1 uint32)
    {%f0, %f1, %f2, %f3};    // C (accumulator input, 4 f32 per thread)

Decoding the instruction name:

mma.sync: warp-synchronous (all 32 threads execute together)
aligned: 16-byte aligned memory (required)
m16n8k16: M=16, N=8, K=16 tile dimensions
row.col: A is row-major, B is column-major
f32.f16.f16.f32: output=f32, A=f16, B=f16, accumulator=f32

The register count per thread for an m16n8k16 MMA:

A: 8 fp16 values = 4 uint32 registers (2 fp16 packed per uint32)
B: 4 fp16 values = 2 uint32 registers
C/D: 4 fp32 values = 4 float registers

This low-level view explains why Tensor Core programming is hard to get right manually: the register layout is architecture-specific and the hardware requires precise packing. WMMA, Triton, and CUTLASS all handle this packing for you.

Inline PTX in CUDA

For situations where WMMA is too high-level but writing raw PTX is necessary:

// Inline PTX MMA instruction in CUDA kernel
// This is what CUTLASS and hand-optimized kernels do
__device__ void mma_m16n8k16_f16_f32(
    float (&d)[4],           // output
    uint32_t const (&a)[4],  // A fragments (fp16 packed)
    uint32_t const (&b)[2],  // B fragments (fp16 packed)
    float const (&c)[4])     // accumulator input
{
    asm volatile(
        "mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 "
        "{%0,%1,%2,%3},"
        "{%4,%5,%6,%7},"
        "{%8,%9},"
        "{%10,%11,%12,%13};"
        : "=f"(d[0]), "=f"(d[1]), "=f"(d[2]), "=f"(d[3])
        : "r"(a[0]), "r"(a[1]), "r"(a[2]), "r"(a[3]),
          "r"(b[0]), "r"(b[1]),
          "f"(c[0]), "f"(c[1]), "f"(c[2]), "f"(c[3])
    );
}

This is the level at which FlashAttention-2 is implemented. Unless you are writing a library, use WMMA or Triton instead.

Core Concept 4 - Complete WMMA Kernel

FP16 GEMM Using WMMA API

This is a complete, minimal GEMM kernel using WMMA. It demonstrates the tiling structure required: the outer loops tile the output matrix into WMMA_M x WMMA_N blocks, and the K loop accumulates K-dimension tiles:

#include <cuda_fp16.h>
#include <mma.h>
#include <cuda_runtime.h>

using namespace nvcuda;

// Tile sizes for WMMA
const int WMMA_M = 16;
const int WMMA_N = 16;
const int WMMA_K = 16;

// Block tile sizes (multiple of WMMA tile)
const int BLOCK_ROW_TILES = 4;  // 4 * 16 = 64 rows per block
const int BLOCK_COL_TILES = 4;  // 4 * 16 = 64 cols per block

// Each warp computes one WMMA_M x WMMA_N output tile
// 4x4 = 16 warps per block (512 threads / 32 = 16 warps)
__global__ void wmma_gemm_fp16(
    const half* __restrict__ A,   // M x K, row-major
    const half* __restrict__ B,   // K x N, row-major
    float*       __restrict__ C,  // M x N, row-major, accumulator
    int M, int N, int K)
{
    // Shared memory tiles for A and B
    // Pad by 8 elements to avoid shared memory bank conflicts
    __shared__ half s_A[BLOCK_ROW_TILES * WMMA_M][WMMA_K + 8];
    __shared__ half s_B[WMMA_K][BLOCK_COL_TILES * WMMA_N + 8];

    // Warp and lane identification
    int warp_id = threadIdx.x / 32;
    int lane_id = threadIdx.x % 32;

    // Which output tile this warp is responsible for
    int warp_row = warp_id / BLOCK_COL_TILES;  // row within block tile
    int warp_col = warp_id % BLOCK_COL_TILES;  // col within block tile

    // Global row/col offset for this block
    int block_row = blockIdx.x * (BLOCK_ROW_TILES * WMMA_M);
    int block_col = blockIdx.y * (BLOCK_COL_TILES * WMMA_N);

    // Row/col of the output tile this warp computes
    int out_row = block_row + warp_row * WMMA_M;
    int out_col = block_col + warp_col * WMMA_N;

    // WMMA accumulator fragment - initialized to zero
    wmma::fragment<wmma::accumulator, WMMA_M, WMMA_N, WMMA_K, float> acc_frag;
    wmma::fill_fragment(acc_frag, 0.0f);

    // Loop over K tiles
    for (int k_tile = 0; k_tile < K; k_tile += WMMA_K) {

        // Cooperative load of A tile into shared memory
        // All threads in the block load together
        for (int i = threadIdx.x; i < BLOCK_ROW_TILES * WMMA_M * WMMA_K;
             i += blockDim.x) {
            int row = i / WMMA_K;
            int col = i % WMMA_K;
            int global_row = block_row + row;
            int global_col = k_tile + col;
            s_A[row][col] = (global_row < M && global_col < K)
                            ? A[global_row * K + global_col]
                            : __float2half(0.0f);
        }

        // Cooperative load of B tile into shared memory
        for (int i = threadIdx.x; i < WMMA_K * BLOCK_COL_TILES * WMMA_N;
             i += blockDim.x) {
            int row = i / (BLOCK_COL_TILES * WMMA_N);
            int col = i % (BLOCK_COL_TILES * WMMA_N);
            int global_row = k_tile + row;
            int global_col = block_col + col;
            s_B[row][col] = (global_row < K && global_col < N)
                            ? B[global_row * N + global_col]
                            : __float2half(0.0f);
        }

        __syncthreads();

        // Each warp loads its WMMA tile and executes MMA
        wmma::fragment<wmma::matrix_a, WMMA_M, WMMA_N, WMMA_K,
                       half, wmma::row_major> a_frag;
        wmma::fragment<wmma::matrix_b, WMMA_M, WMMA_N, WMMA_K,
                       half, wmma::row_major> b_frag;

        // Load from shared memory - stride is the full shared mem row width
        const half* smem_a_ptr = &s_A[warp_row * WMMA_M][0];
        const half* smem_b_ptr = &s_B[0][warp_col * WMMA_N];

        wmma::load_matrix_sync(a_frag, smem_a_ptr, WMMA_K + 8);
        wmma::load_matrix_sync(b_frag, smem_b_ptr, BLOCK_COL_TILES * WMMA_N + 8);

        // Tensor Core MMA: acc_frag += a_frag * b_frag
        wmma::mma_sync(acc_frag, a_frag, b_frag, acc_frag);

        __syncthreads();
    }

    // Store the accumulated result to global memory
    if (out_row < M && out_col < N) {
        wmma::store_matrix_sync(&C[out_row * N + out_col], acc_frag, N,
                                wmma::mem_row_major);
    }
}

// Launch configuration
void launch_wmma_gemm(const half* A, const half* B, float* C,
                      int M, int N, int K) {
    // Grid: cover M rows in BLOCK_ROW_TILES*WMMA_M chunks
    //       cover N cols in BLOCK_COL_TILES*WMMA_N chunks
    dim3 grid((M + BLOCK_ROW_TILES * WMMA_M - 1) / (BLOCK_ROW_TILES * WMMA_M),
              (N + BLOCK_COL_TILES * WMMA_N - 1) / (BLOCK_COL_TILES * WMMA_N));

    // 512 threads per block = 16 warps = 16 WMMA tiles per block
    dim3 block(512);

    wmma_gemm_fp16<<<grid, block>>>(A, B, C, M, N, K);
}

Understanding the Warp Tile Layout

Core Concept 5 - Triton: Automatic Tensor Core Dispatch

Why Triton for Tensor Cores?

Triton is a Python-embedded DSL for writing GPU kernels that automatically maps tl.dot() to Tensor Core MMA instructions when the conditions are met. It handles:

Register fragment layout automatically
WGMMA instruction generation for Hopper
Software pipelining of loads with computation
Shape compliance checking at compile time

For most engineers writing custom attention mechanisms, sparse operations, or non-standard GEMMs, Triton is the right level of abstraction. WMMA directly is error-prone. MMA PTX is for library authors.

Triton GEMM that Uses Tensor Cores

import triton
import triton.language as tl
import torch

@triton.jit
def matmul_kernel(
    A_ptr, B_ptr, C_ptr,
    M, N, K,
    stride_am, stride_ak,
    stride_bk, stride_bn,
    stride_cm, stride_cn,
    # Block sizes as constexpr - MUST be compile-time constants for Tensor Cores
    BLOCK_M: tl.constexpr,  # must be multiple of 16
    BLOCK_N: tl.constexpr,  # must be multiple of 16
    BLOCK_K: tl.constexpr,  # must be multiple of 16
):
    # Program ID
    pid_m = tl.program_id(0)
    pid_n = tl.program_id(1)

    # Tile offsets
    offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
    offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
    offs_k = tl.arange(0, BLOCK_K)

    # Pointers to current tiles
    A_tile_ptr = A_ptr + offs_m[:, None] * stride_am + offs_k[None, :] * stride_ak
    B_tile_ptr = B_ptr + offs_k[:, None] * stride_bk + offs_n[None, :] * stride_bn

    # FP32 accumulator
    acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)

    # K loop - tl.dot uses Tensor Cores automatically
    for k in range(0, K, BLOCK_K):
        # Load tiles from global memory
        a = tl.load(A_tile_ptr, mask=offs_m[:, None] < M, other=0.0)
        b = tl.load(B_tile_ptr, mask=offs_n[None, :] < N, other=0.0)

        # tl.dot maps to WMMA/WGMMA Tensor Core instruction
        # REQUIRES: a and b must be fp16 or bf16 for Tensor Core activation
        # REQUIRES: BLOCK_M, BLOCK_N, BLOCK_K must be multiples of 16
        acc += tl.dot(a, b)

        A_tile_ptr += BLOCK_K * stride_ak
        B_tile_ptr += BLOCK_K * stride_bk

    # Write output
    offs_cm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
    offs_cn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
    C_tile_ptr = C_ptr + offs_cm[:, None] * stride_cm + offs_cn[None, :] * stride_cn
    tl.store(C_tile_ptr, acc, mask=(offs_cm[:, None] < M) & (offs_cn[None, :] < N))


def matmul_triton(A: torch.Tensor, B: torch.Tensor) -> torch.Tensor:
    M, K = A.shape
    K2, N = B.shape
    assert K == K2

    # CRITICAL: inputs must be fp16 or bf16 for Tensor Cores
    assert A.dtype in (torch.float16, torch.bfloat16), \
        f"Expected fp16/bf16, got {A.dtype} - Tensor Cores will NOT activate with fp32"

    C = torch.zeros((M, N), dtype=torch.float32, device=A.device)

    # Grid: one program per output block
    grid = lambda meta: (
        triton.cdiv(M, meta['BLOCK_M']),
        triton.cdiv(N, meta['BLOCK_N']),
    )

    matmul_kernel[grid](
        A, B, C,
        M, N, K,
        A.stride(0), A.stride(1),
        B.stride(0), B.stride(1),
        C.stride(0), C.stride(1),
        # Block sizes MUST be multiples of 16 for Tensor Core activation
        BLOCK_M=64, BLOCK_N=64, BLOCK_K=32,
    )
    return C


# Verify Tensor Core usage
if __name__ == "__main__":
    M, N, K = 4096, 4096, 4096

    # This WILL use Tensor Cores
    A_fp16 = torch.randn(M, K, dtype=torch.float16, device='cuda')
    B_fp16 = torch.randn(K, N, dtype=torch.float16, device='cuda')
    C = matmul_triton(A_fp16, B_fp16)
    print("FP16 matmul - Tensor Cores should be active")

    # Profile with Nsight to verify:
    # ncu --metrics sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_active
    #   python this_script.py

Triton Tensor Core Activation Checklist

# WILL use Tensor Cores:
BLOCK_M=64, BLOCK_N=64, BLOCK_K=32   # all multiples of 16 - good
A.dtype = torch.float16               # fp16 - good
B.dtype = torch.float16               # fp16 - good

# WILL NOT use Tensor Cores:
BLOCK_M=60, BLOCK_N=64, BLOCK_K=32   # 60 is not a multiple of 16 - bad
A.dtype = torch.float32               # fp32 without TF32 mode - bad
BLOCK_K=8                             # too small for H100 WGMMA - bad

Core Concept 6 - CUTLASS

What CUTLASS Is and When to Use It

CUTLASS (CUDA Templates for Linear Algebra Subroutines and Solvers) is NVIDIA's open-source template library for high-performance GEMM. It generates Tensor Core kernels with:

Double-buffering of shared memory tiles (software pipelining)
Optimal register file layout for the target architecture
Epilogue fusion (add bias, apply relu, etc. without extra kernel launches)
Architecture-specific optimizations compiled in

CUTLASS typically matches cuBLAS performance and sometimes exceeds it for non-standard epilogues.

#include <cutlass/gemm/device/gemm.h>

// Define the GEMM operation type
using ElementA = cutlass::half_t;           // FP16 input A
using ElementB = cutlass::half_t;           // FP16 input B
using ElementC = float;                     // FP32 accumulator
using ElementOutput = float;               // FP32 output

using LayoutA = cutlass::layout::RowMajor;
using LayoutB = cutlass::layout::ColumnMajor;  // col-major B is optimal for TC
using LayoutC = cutlass::layout::RowMajor;

// Tile sizes (must satisfy Tensor Core requirements)
using ThreadblockShape = cutlass::gemm::GemmShape<128, 128, 32>;
using WarpShape        = cutlass::gemm::GemmShape<64, 64, 32>;
using InstructionShape = cutlass::gemm::GemmShape<16, 8, 16>;  // Tensor Core tile

// CUTLASS GEMM type - generates Tensor Core kernel automatically
using Gemm = cutlass::gemm::device::Gemm<
    ElementA, LayoutA,
    ElementB, LayoutB,
    ElementC, LayoutC,
    ElementOutput,
    cutlass::arch::OpClassTensorOp,  // USE TENSOR CORES
    cutlass::arch::Sm80,             // Ampere (use Sm90 for Hopper)
    ThreadblockShape, WarpShape, InstructionShape,
    cutlass::epilogue::thread::LinearCombination<ElementOutput, 1, ElementC, float>,
    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>,
    4  // pipeline stages (double-buffering = 2, quad-buffering = 4)
>;

// Run the GEMM
void run_cutlass_gemm(half* A, half* B, float* C, float* D,
                      int M, int N, int K, float alpha, float beta) {
    typename Gemm::Arguments args(
        {M, N, K},
        {A, K},   // A ptr, leading dimension
        {B, N},   // B ptr, leading dimension
        {C, N},   // C ptr, leading dimension
        {D, N},   // D ptr, leading dimension
        {alpha, beta}
    );

    Gemm gemm_op;
    cutlass::Status status = gemm_op(args);
    if (status != cutlass::Status::kSuccess) {
        printf("CUTLASS GEMM failed: %s\n",
               cutlass::cutlassGetStatusString(status));
    }
}

CUTLASS vs cuBLAS vs WMMA vs Triton - when to use each:

Tool	Use case	Performance	Code complexity
cuBLAS	Standard GEMM, no custom epilogue	Highest	Minimal
CUTLASS	Custom epilogue (fused ops), non-standard layout	cuBLAS-level	Medium
Triton	Custom attention, non-square ops, research	Near-optimal	Low-Medium
WMMA direct	Learning, prototyping, simple kernels	Good	High
MMA PTX	Library development, FlashAttention-level	Optimal	Very High

Core Concept 7 - Diagnosing Zero Tensor Core Utilization

The Diagnostic Workflow

When Nsight shows 0% Tensor Core utilization on a matrix multiply kernel, follow this checklist:

Step 1: Check data types

# In Python - most common cause
print(f"A dtype: {A.dtype}")  # must be fp16, bf16, or int8
print(f"B dtype: {B.dtype}")  # same

# If FP32 on Ampere+, check TF32 setting:
print(torch.backends.cuda.matmul.allow_tf32)  # should be True

// In CUDA - check your fragment type
// WRONG - FP32 inputs do not activate Tensor Cores
wmma::fragment<wmma::matrix_a, 16, 16, 16, float, wmma::row_major> a_frag;  // float!

// RIGHT - FP16 inputs
wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::row_major> a_frag;

Step 2: Verify dimension multiples

M, N, K = A.shape[0], B.shape[1], A.shape[1]
assert M % 16 == 0, f"M={M} is not a multiple of 16"
assert N % 16 == 0, f"N={N} is not a multiple of 16"
assert K % 16 == 0, f"K={K} is not a multiple of 16"

Step 3: Check memory alignment

// Verify 16-byte alignment
assert(((uintptr_t)A_ptr % 16) == 0);
assert(((uintptr_t)B_ptr % 16) == 0);
assert(((uintptr_t)C_ptr % 16) == 0);

Step 4: Verify with Nsight Compute

# The definitive Tensor Core metric
ncu --metrics \
  sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_active,\
  sm__pipe_fma_cycles_active.avg.pct_of_peak_sustained_active \
  ./myprogram

# If tensor > 0%: Tensor Cores are firing
# If tensor == 0% and fma > 0%: CUDA cores only
# If both == 0%: memory bound or stalled

Step 5: Inspect PTX/SASS

# Look for mma instructions in SASS
cuobjdump --dump-sass myprogram | grep -i "mma\|HMMA\|DMMA\|IMMA"

# If no mma instructions found: compiler did not generate Tensor Core code
# Check WMMA calls and types

The Nsight Roofline for Tensor Cores

After fixing Tensor Core utilization, use the Roofline model to understand if you are getting near-peak performance:

Production Engineering Notes

Mixed Precision Training Pattern

The standard production pattern for Tensor Core usage in training:

import torch
from torch.cuda.amp import autocast, GradScaler

# Enable TF32 for FP32 ops (uses Tensor Cores with truncated mantissa)
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

model = MyTransformerModel().cuda()
optimizer = torch.optim.Adam(model.parameters())
scaler = GradScaler()  # loss scaling for FP16 stability

for batch in dataloader:
    optimizer.zero_grad()

    # autocast: linear layers run in FP16 -> Tensor Cores active
    # loss stays in FP32 for numerical stability
    with autocast(dtype=torch.float16):
        output = model(batch['input'])
        loss = loss_fn(output, batch['target'])

    # scaler prevents FP16 underflow in gradients
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

The autocast context manager automatically casts eligible ops to FP16 (activating Tensor Cores) while keeping loss computation in FP32. The GradScaler multiplies the loss by a large scalar before backward pass to prevent gradient underflow in FP16.

Padding Matrices to Tensor Core Multiples

When your actual matrix dimensions are not multiples of 16 (or 8 for INT8), pad them:

def pad_to_multiple(tensor: torch.Tensor, multiple: int = 16) -> torch.Tensor:
    """Pad a matrix's dimensions to multiples of `multiple` for Tensor Cores."""
    m, n = tensor.shape
    pad_m = (multiple - m % multiple) % multiple
    pad_n = (multiple - n % multiple) % multiple
    if pad_m == 0 and pad_n == 0:
        return tensor
    return torch.nn.functional.pad(tensor, (0, pad_n, 0, pad_m))

# In a linear layer:
A = pad_to_multiple(A_original, 16)   # pad M and K
B = pad_to_multiple(B_original, 16)   # pad K and N
C = torch.mm(A, B)
C_trimmed = C[:M_original, :N_original]  # trim padding from output

For transformer models, it is better to pad the embedding dimension and vocabulary size at initialization time rather than padding every forward pass.

Verifying Tensor Core Use in PyTorch

import torch
from torch.profiler import profile, ProfilerActivity

model = torch.nn.Linear(4096, 4096, dtype=torch.float16).cuda()
x = torch.randn(64, 4096, dtype=torch.float16, device='cuda')

with profile(activities=[ProfilerActivity.CUDA],
             with_flops=True) as prof:
    for _ in range(10):
        y = model(x)

# Print top operations by CUDA time
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

# Look for: 'aten::mm' or 'aten::linear' - should show high FLOPS
# If TFLOPS shown is close to 312 (A100) or 989 (H100 FP8):
# Tensor Cores are active

Batch Dimension and BMMD

Batched matrix multiply (torch.bmm, torch.baddbmm) uses Tensor Cores if:

All conditions above are met for each sub-matrix
The batch dimension does not fragment the computation into tiny tiles

For batch sizes of 1 or very small (< 32), Tensor Core utilization often drops because the M and N tiles are too small to fill the SM. Fuse the batch dimension into M for small-batch inference:

# Small batch: [B, seq_len, d_model] -> [B * seq_len, d_model]
# This gives larger M dimension, better TC utilization
x_reshaped = x.view(-1, d_model)  # merge batch and seq_len
y_reshaped = linear(x_reshaped)
y = y_reshaped.view(B, seq_len, -1)

Common Mistakes

:::danger Wrong data type is the most common cause of 0% Tensor Core utilization The single most common Tensor Core bug: using float (FP32) inputs without enabling TF32 mode. FP32 inputs to wmma::load_matrix_sync using half fragment type is a compile error. But float inputs to torch.mm without TF32 mode quietly fall back to CUDA cores with no warning. Always explicitly check A.dtype before performance-critical matrix operations and assert that it is torch.float16 or torch.bfloat16. :::

:::danger All 32 threads in a warp must call WMMA functions together WMMA operations are warp-collective. If any thread in the warp takes a different code path and skips the wmma::mma_sync call, the entire warp hangs or produces incorrect results. Never put wmma::mma_sync inside an if block that could be skipped by some threads in the warp. Handle boundary conditions by padding the matrices to Tensor Core multiples before the kernel, not with conditional logic inside the kernel. :::

:::danger Fragment stride argument is not the tile size, it is the matrix leading dimension The stride argument to wmma::load_matrix_sync is the leading dimension of the source matrix in memory, not the tile size. For a row-major matrix A of shape MxK loaded from shared memory with shared memory width SMEM_WIDTH, the stride is SMEM_WIDTH, not K. Getting this wrong produces no compile error but silently loads incorrect data into the fragment.

// WRONG - stride = WMMA_K (tile size)
wmma::load_matrix_sync(a_frag, smem_a_ptr, WMMA_K);

// RIGHT - stride = smem row width (leading dimension)
wmma::load_matrix_sync(a_frag, smem_a_ptr, K_smem_width);

:::

:::warning Accumulator fragments are not automatically initialized to zero wmma::fill_fragment(acc_frag, 0.0f) must be called explicitly before the first wmma::mma_sync. Uninitialized fragments contain garbage values. The hardware does not reset accumulators between uses - the fragment retains whatever value was in the registers from the previous operation. :::

:::warning Column-major B matrix layout usually gives better Tensor Core performance The standard GEMM convention (as used in cuBLAS and CUTLASS) uses row-major A and column-major B. With column-major B, the K-dimension tiles of B are contiguous in memory, enabling efficient 128-bit loads from shared memory into Tensor Core fragments. Row-major B requires transposed shared memory layouts and extra shuffles. If you have a choice in how to store B, prefer column-major. :::

:::warning Dimension padding affects correctness if accumulation wraps When padding matrices to Tensor Core multiples, the padding elements must be zero (not garbage). A padding element in A multiplied by a real element in B contributes to the accumulator. If the padding is zero, this contribution is zero and the result is correct. If padding is garbage, the result is corrupted. Always use torch.zeros or cudaMemset for padding, never torch.empty or uninitialized cudaMalloc. :::

Interview Questions and Answers

Q1: What is the WMMA API and how do you use it to perform a matrix multiply with Tensor Cores?

WMMA (Warp Matrix Multiply-Accumulate) is the CUDA C++ API in nvcuda::wmma namespace that exposes Tensor Core operations at the warp level. The fundamental abstraction is the fragment, a matrix tile distributed cooperatively across all 32 threads in a warp. You do not control which element each thread holds - the layout is opaque and architecture-specific. The workflow is: (1) declare fragment objects with template parameters specifying role (matrix_a, matrix_b, accumulator), tile dimensions (16x16x16 for fp16), data type (half or float), and layout (row_major/col_major); (2) call wmma::fill_fragment to initialize the accumulator to zero; (3) load tiles from shared memory into fragments using wmma::load_matrix_sync with the correct stride (the matrix's leading dimension, not the tile size); (4) call wmma::mma_sync to perform the Tensor Core D = A*B+C operation; (5) call wmma::store_matrix_sync to write the result back to shared or global memory. All WMMA calls are warp-collective - all 32 threads must call them with the same arguments.

Q2: What are the activation requirements for Tensor Cores on H100, and what happens if any are violated?

H100 Tensor Core activation requires three things simultaneously: correct data types (FP16, BF16, TF32, FP8, INT8, or INT4 - not plain FP32), dimension multiples of 16 for M, N, and K (multiples of 32 for K with INT8), and 16-byte memory alignment for all matrix pointers. If any requirement is violated, the operation falls back to regular CUDA cores silently - no error, no warning, no exception. The fallback runs at FP32 CUDA core throughput (~19.5 TFLOPS on A100), which is 16x slower than FP16 Tensor Core throughput (312 TFLOPS). The most common violation is FP32 input types. On Ampere and later, setting torch.backends.cuda.matmul.allow_tf32 = True enables automatic truncation of FP32 inputs to TF32 (10-bit mantissa), which activates Tensor Cores even for FP32 code at the cost of some precision.

Q3: How do you diagnose and fix zero Tensor Core utilization in Nsight Compute?

The primary metric is sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_active. Zero means Tensor Cores are not active. The diagnostic process: first check data types - print the dtype of input tensors or check fragment type declarations in CUDA code; FP32 without TF32 mode is the most common cause. Second, verify dimension multiples - M, N, K must all be multiples of 16 for FP16/BF16, and K must be a multiple of 32 for some INT8 configurations. Third, check memory alignment using assert(((uintptr_t)ptr % 16) == 0). Fourth, examine the SASS with cuobjdump --dump-sass and look for HMMA (half-precision MMA), DMMA (double), or IMMA (integer) instructions - their absence confirms the compiler did not generate Tensor Core code. Finally, cross-reference with sm__pipe_fma_cycles_active - if this is high while tensor is zero, CUDA cores are active and the workload has fallen back.

Q4: What is the difference between WMMA, MMA PTX, and Triton for Tensor Core programming?

WMMA is the highest-level CUDA C++ API, using C++ templates and warp-collective fragment objects. It is suitable for learning and simple custom kernels but does not expose the register layout needed for aggressive optimization. MMA PTX is the intermediate assembly level - MMA instructions encoded directly in PTX, used in CUTLASS and hand-written high-performance kernels like FlashAttention. It gives full control over register packing, instruction scheduling, and pipeline stages but requires deep knowledge of architecture-specific register layouts and is architecture-specific (different instructions for Turing, Ampere, Hopper). Triton is a Python DSL where tl.dot() automatically maps to the correct Tensor Core instruction for the target GPU - WMMA on Turing/Ampere, WGMMA on Hopper. Triton handles register layout, packing, and software pipelining automatically, achieving near-CUTLASS performance for most operations. For production ML systems: use cuBLAS for standard GEMM, Triton for custom attention or fused operations, CUTLASS for custom epilogues, and WMMA/MMA PTX only when writing a GPU library.

Q5: Why must Tensor Core tile dimensions be multiples of 16, and what is the consequence of using non-multiples?

The 16x16x16 constraint comes from the physical wiring of the Tensor Core hardware. A single Tensor Core unit performs one 16x16x16 matrix multiply in one clock cycle, and the 32 threads of a warp cooperatively feed exactly 16*16 = 256 input elements to it. If your matrix dimensions are not multiples of 16, there is no partial Tensor Core operation - the hardware cannot execute a 14x16x16 multiply. The GPU must either pad the operation internally (it does not) or fall back to scalar CUDA cores. For dimensions that are not multiples of 16, the standard solution is explicit padding: allocate matrices padded to the next multiple of 16, fill the padding region with zeros (not garbage), perform the matrix multiply on the padded dimensions, and trim the output back to the original dimensions. The overhead of padding and trimming is usually negligible compared to the 10-16x throughput gain from activating Tensor Cores.

Q6: Explain the CUTLASS template library and when you would use it over WMMA code or cuBLAS.

CUTLASS is NVIDIA's open-source C++ template library for GEMM that generates highly optimized Tensor Core kernels by composing small building blocks: ThreadblockShape (the tile each block handles), WarpShape (the tile each warp handles), InstructionShape (the Tensor Core MMA tile), and an Epilogue (post-multiply operations like add bias or apply activation). CUTLASS typically matches cuBLAS performance because it uses the same optimization techniques: double-buffering (software pipelining of shared memory loads with MMA execution), optimal register file layout, and architecture-specific instruction sequences. The primary reason to use CUTLASS over cuBLAS is epilogue fusion: if you need to apply a non-standard activation function, add a residual connection, or perform quantization immediately after the matrix multiply, doing it in a separate kernel forces an extra global memory round-trip. CUTLASS fuses arbitrary epilogues into the GEMM kernel at no additional cost. Use cuBLAS when the standard GEMM interface suffices. Use CUTLASS when you need a non-standard epilogue but standard inputs and outputs. Use Triton when you need non-standard tiling, attention-like computation patterns, or rapid iteration.

Q7: What is TF32 and how does it enable Tensor Core usage for FP32 workloads?

TF32 (TensorFloat-32) is a 19-bit floating-point format introduced in Ampere that has the same 8-bit exponent as FP32 (same range) but only 10 bits of mantissa instead of 23 (lower precision, similar to FP16's mantissa width). TF32 is not directly programmable - it is an internal hardware mode. When TF32 mode is enabled (default in PyTorch 1.7+ for Ampere GPUs via torch.backends.cuda.matmul.allow_tf32 = True), the hardware automatically truncates FP32 inputs to TF32 format before feeding them to Tensor Cores, performs the matrix multiply in Tensor Cores with TF32 arithmetic, and produces FP32 outputs. The programmer sees FP32 inputs and FP32 outputs, but the computation uses Tensor Cores and runs at approximately 10x the speed of pure FP32. The precision loss is in the mantissa: TF32 has ~3 decimal digits of precision vs FP32's ~7. For neural network training this is acceptable (gradients are noisy anyway), but for scientific computing or financial applications requiring full FP32 precision, TF32 mode should be disabled with torch.backends.cuda.matmul.allow_tf32 = False.

Summary

Tensor Core programming is not difficult once you understand the contract: the hardware provides a 10-16x throughput multiplier for matrix multiply, and in return it requires exact compliance with data type, dimension, and alignment constraints. There are no partial failures - either all requirements are met and you get Tensor Core speed, or one requirement fails and you silently fall back to CUDA core speed.

The practical hierarchy of tools, from highest to lowest level of abstraction:

PyTorch autocast + TF32 mode: transparent, covers 90% of training use cases
Triton tl.dot(): for custom attention and non-standard patterns, near-optimal performance
CUTLASS: for custom epilogues with standard GEMM, library-quality performance
WMMA API: for learning and simple custom kernels
MMA PTX: for library development, achieving the final few percent of peak throughput

The diagnostic tool is always Nsight Compute: sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_active. Zero percent means the requirements are not met. Systematically check dtype first (most common failure), then dimensions, then alignment. The engineer in the opening scenario lost a week to a two-line fix. With this lesson, you will fix it in ten minutes.

The Bug That Cost a Week​

Why This Exists - The Arithmetic Bottleneck of Deep Learning​

The Problem: Deep Learning is Dense Matrix Multiply​

The Solution: Domain-Specific Matrix Hardware​

Historical Context - Tensor Core Evolution Across Architectures​

Core Concept 1 - Tensor Core Activation Requirements​

The Hard Requirements (H100)​

TF32 - The Automatic Tensor Core Path for FP32​

Core Concept 2 - The WMMA API​

Fragments: The Fundamental Abstraction​

Standard Tile Sizes​

Core Concept 3 - MMA PTX Instructions​

Below WMMA: The PTX Layer​

Inline PTX in CUDA​

Core Concept 4 - Complete WMMA Kernel​

FP16 GEMM Using WMMA API​

Understanding the Warp Tile Layout​

Core Concept 5 - Triton: Automatic Tensor Core Dispatch​

Why Triton for Tensor Cores?​

Triton GEMM that Uses Tensor Cores​

Triton Tensor Core Activation Checklist​

Core Concept 6 - CUTLASS​

What CUTLASS Is and When to Use It​

Core Concept 7 - Diagnosing Zero Tensor Core Utilization​

The Diagnostic Workflow​

The Nsight Roofline for Tensor Cores​

Production Engineering Notes​

Mixed Precision Training Pattern​

Padding Matrices to Tensor Core Multiples​

Verifying Tensor Core Use in PyTorch​

Batch Dimension and BMMD​

Common Mistakes​

Interview Questions and Answers​

Summary​