Memory Coalescing and Bank Conflicts
Reading time: ~40 min - Interview relevance: Very High - Target roles: CUDA Developer, Kernel Engineer, ML Infrastructure
A coalesced memory access reads 128 bytes in one transaction. The identical data read with a stride of N columns reads it in 32 separate transactions. The math is the same. The throughput is 32x different.
The 8x Mystery
Two matrix kernels. Identical floating point operations. One copies a matrix, the other copies its transpose. You would expect them to run at the same speed - the number of reads and writes is identical.
Profile them on an H100:
Matrix copy (row order): 450 GB/s (13% of peak)
Matrix transpose (naive): 55 GB/s (1.6% of peak)
8x throughput difference. Same GPU. Same data. Same number of bytes transferred.
The difference is entirely in the memory access pattern. The copy kernel has every warp accessing 32 consecutive floats in memory - one 128-byte transaction. The naive transpose has every warp accessing 32 floats spread across 32 different rows of the matrix - 32 separate 128-byte transactions, most of which bring in data that the warp does not need.
This is memory coalescing. It is not a micro-optimization or a GPU-specific curiosity. It is the most impactful single optimization in the majority of memory-bound CUDA kernels. Understanding it changes how you design every kernel you write.
Why This Exists: Cache Lines and Transaction Granularity
GPUs load global memory in 128-byte chunks aligned to 128-byte boundaries. This granularity comes from the same principle as CPU cache lines: memory controllers are far more efficient when serving sequential addresses. The hardware that connects the compute die to HBM is a collection of wide, high-speed buses, and they move data most efficiently when it is sequential.
When a warp of 32 threads issues memory loads, the hardware groups them. If 32 threads load 32 consecutive floats starting at a 128-byte aligned address, the hardware recognizes this as a single coalesced access and issues one 128-byte memory transaction. One transaction, 128 bytes, all threads satisfied.
If those same 32 threads load 32 floats at addresses that are each separated by a stride of 1024 bytes (the width of a 1024-column matrix), the hardware must issue 32 separate transactions of 128 bytes each - 4096 bytes of transactions to move 128 bytes of actual data. The utilization of each transaction is 3.1%.
The L1 and L2 caches partially mitigate this through caching, but they cannot overcome the fundamental problem: at 32 separate memory requests per warp instruction, the memory subsystem is 32x more loaded than it needs to be.
Memory Coalescing: The Rules
A warp access is fully coalesced when all 32 threads access addresses that fall within the same aligned 128-byte region. In practice, this means:
Thread in the warp accesses element - consecutive float addresses, one 128-byte transaction per warp.
Thread accesses element - strided addresses. Each stride step of 4 bytes = one float. For stride=32, the 32 threads span 32 x 32 x 4 = 4096 bytes = 32 cache lines.
The coalescing rule for different data types:
| Data type | Size | Threads per warp | Bytes per coalesced load | Cache lines |
|---|---|---|---|---|
| float32 | 4B | 32 | 128 B | 1 |
| float64 | 8B | 32 | 256 B | 2 |
| float16 | 2B | 32 | 64 B | 0.5 (shares cache line with adjacent warp) |
The Matrix Transpose Problem
Matrix transpose is the canonical example of the coalescing problem because it makes both pathologies visible in a single kernel: one memory operation is coalesced, the other is not.
Transpose: given matrix A stored row-major, compute where .
Naive Transpose
// Each thread (i, j) reads A[i][j] and writes to B[j][i]
__global__ void transpose_naive(
const float* __restrict__ A,
float* __restrict__ B,
int rows, int cols
) {
int col = blockIdx.x * blockDim.x + threadIdx.x; // output col
int row = blockIdx.y * blockDim.y + threadIdx.y; // output row
if (row < rows && col < cols) {
// READ: A[row][col] - threads in a warp have same row, consecutive col
// -> col increases by 1 per thread -> consecutive addresses -> COALESCED
float val = A[row * cols + col];
// WRITE: B[col][row] - threads in a warp have same row (= same col index in B)
// -> B index = col * rows + row, col is same for all threads in a warp
// -> row increases by 1 per thread -> consecutive addresses -> COALESCED
// Wait... is this actually coalesced? Let's check:
// thread 0: writes B[col * rows + row]
// thread 1: writes B[col * rows + row + blockDim.x * rows]
// Different rows in B means stride = rows in B's layout -> NOT COALESCED
B[col * rows + row] = val;
}
}
Wait - let's work out the access pattern carefully. With a 2D thread block of shape (32, 32) and blockIdx = (0, 0):
threadIdx.xvaries 0..31,threadIdx.yvaries 0..31- For threads with the same
threadIdx.y(same row in a block, same warp row):col = threadIdx.xvaries 0..31
The read A[row * cols + col]: for fixed row and varying col from 0 to 31, this is A[row*cols + 0], A[row*cols + 1], ..., A[row*cols + 31] - consecutive addresses. Coalesced.
The write B[col * rows + row]: for fixed row and varying col from 0 to 31, this is B[0*rows + row], B[1*rows + row], ..., B[31*rows + row]. The stride between consecutive threads is rows (not 1). For a 1024-row matrix, each consecutive write is 1024 floats = 4096 bytes apart. Not coalesced - 32 separate transactions.
Alternatively, if we swap the thread-to-work assignment (thread reads by column, writes by row):
- Read is strided (not coalesced)
- Write is sequential (coalesced)
Either way, one of the two operations is not coalesced. This is why naive transpose is slow.
Profiling the Naive Transpose
// What Nsight Compute reports for naive transpose on a 4096x4096 matrix:
//
// l2_global_load_bytes: 2.1 GB (4096*4096*4 = 64 MB, read 32x due to stride)
// l2_global_store_bytes: 64 MB (writes are coalesced)
// memory_throughput: 42 GB/s (vs 3350 GB/s peak)
// Memory efficiency (loads): 3.1%
The reads are 32x over-fetched. Each load transaction brings in 128 bytes but only 4 bytes are used. The rest are wasted bandwidth.
The Fix: Shared Memory as a Coalescing Buffer
The solution uses shared memory as an intermediary. The insight is:
- Load a 32x32 tile from global memory A row by row - coalesced reads
- Store into shared memory in the same row-major order
- Read from shared memory column by column to write to B - column reads from smem are cheap (on-chip)
- Write to global memory B in row-major order - coalesced writes
The transpose happens inside shared memory - free because smem is on-chip and random access is fast.
#define TILE_DIM 32
#define BLOCK_ROWS 8 // Process 8 rows per thread (each thread handles 4 rows of 32-col tile)
__global__ void transpose_coalesced(
const float* __restrict__ A,
float* __restrict__ B,
int rows, int cols
) {
// +1 padding to avoid bank conflicts on column reads from smem
__shared__ float tile[TILE_DIM][TILE_DIM + 1];
// Read tile from A: coalesced because consecutive threads have consecutive col addresses
int x = blockIdx.x * TILE_DIM + threadIdx.x;
int y = blockIdx.y * TILE_DIM + threadIdx.y;
for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) {
if (y + j < rows && x < cols) {
// A[(y+j)][x]: for fixed (y+j), consecutive x -> coalesced global read
tile[threadIdx.y + j][threadIdx.x] = A[(y + j) * cols + x];
}
}
__syncthreads();
// Write tile to B: swap block indices so B receives the transposed tile
x = blockIdx.y * TILE_DIM + threadIdx.x; // note: blockIdx.y for columns of B
y = blockIdx.x * TILE_DIM + threadIdx.y; // note: blockIdx.x for rows of B
for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) {
if (y + j < cols && x < rows) {
// Read from smem: tile[threadIdx.x][threadIdx.y + j]
// This is a column-major read of smem, which would cause bank conflicts
// without the +1 padding (see Bank Conflicts section below)
B[(y + j) * rows + x] = tile[threadIdx.x][threadIdx.y + j];
// Write to B: for fixed (y+j), consecutive x -> coalesced global write
}
}
}
The result: both the global read and global write are now coalesced. The transpose operation itself happens in shared memory at ~32 cycle latency with no global memory penalty.
Why This Works: Memory Layout Diagram
A (row-major, 4x4 example):
[0 1 2 3 ] <- row 0
[4 5 6 7 ] <- row 1
[8 9 10 11] <- row 2
[12 13 14 15] <- row 3
Linear memory: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
Naive transpose write to B - thread(0,0)=0, thread(1,0)=4, thread(2,0)=8...
-> writes to B[0], B[rows], B[2*rows]... stride = rows. NOT COALESCED.
Tiled approach:
Step 1: Load tile into smem[row][col] - reads are row-by-row, COALESCED
smem = [0 1 2 3 ]
[4 5 6 7 ]
[8 9 10 11]
[12 13 14 15]
Step 2: Read smem[col][row] and write to B sequentially - writes are COALESCED
Thread 0 reads smem[0][0]=0, writes B[0]
Thread 1 reads smem[0][1]=1, writes B[1] <- WAIT, this is still row 0 of B
Ah - but in the transposed output, row 0 of B = column 0 of A = [0, 4, 8, 12]
We read smem[threadIdx.x][threadIdx.y] (column-major smem read)
to write B in row-major order (coalesced write)
Bank Conflicts: Deep Dive
Shared memory is fast because it is on-chip SRAM. But it is organized into 32 memory banks, each 4 bytes wide. The bank number for any 4-byte word at byte offset addr in shared memory is:
Two threads accessing different addresses in the same bank in the same instruction have a bank conflict. The hardware serializes conflicting accesses: a 2-way conflict runs in 2 cycles, a 32-way conflict runs in 32 cycles. Zero conflicts: all banks distinct, 1 cycle.
One important exception: broadcast. If all threads in a warp read the exact same address, the hardware broadcasts it in a single cycle - no conflict. This is why constant memory works well for values read uniformly by all threads.
Which Access Patterns Cause Conflicts
// Example: __shared__ float smem[32][32]
// Row read: smem[row][threadIdx.x] for varying threadIdx.x
// Thread i reads smem[row][i], address = row*32*4 + i*4
// Bank = (row*32 + i) % 32 = i % 32
// All different banks -> NO CONFLICT
// Column read: smem[threadIdx.x][col] for varying threadIdx.x
// Thread i reads smem[i][col], address = i*32*4 + col*4
// Bank = (i*32 + col) % 32 = col % 32
// ALL THREADS HIT BANK (col % 32) -> 32-WAY CONFLICT
// Stride-2 read: smem[threadIdx.x * 2]
// Thread i reads smem[i*2], address = i*2*4
// Bank = (i*2) % 32
// Thread 0: bank 0, Thread 1: bank 2, ..., Thread 15: bank 30
// Thread 16: bank (32) % 32 = 0, Thread 17: bank 2 ...
// -> 2-WAY CONFLICTS for all banks (16 threads share each bank)
// Stride-32 read: smem[threadIdx.x * 32]
// Thread i reads smem[i*32], address = i*32*4
// Bank = (i*32) % 32 = 0 for ALL i
// -> 32-WAY CONFLICT (unless all read same address = broadcast)
The pattern: a stride of elements causes a -way conflict (when is not a multiple of 32) or broadcast (when all addresses are identical).
The Padding Fix
For the column-read problem (stride = matrix width = 32), the classic fix is adding one extra element to each row of the shared memory declaration. This shifts each row's starting bank by 1, breaking the alignment that causes all columns to map to the same bank.
// WITHOUT padding - column reads cause 32-way conflicts
__shared__ float smem[32][32];
// Row 0 starts at bank 0
// Row 1 starts at (32 * 4 / 4) % 32 = 32 % 32 = bank 0
// Row 2 starts at bank 0
// All rows start at bank 0
// smem[i][0] for all i: bank = (i * 32 + 0) % 32 = 0 -> 32-way conflict on column 0
// WITH padding - column reads are conflict-free
__shared__ float smem[32][32 + 1]; // 33 elements per row
// Row 0 starts at byte 0, bank 0
// Row 1 starts at byte 33*4 = 132, bank (132/4)%32 = 33%32 = bank 1
// Row 2 starts at byte 66*4 = 264, bank (264/4)%32 = 66%32 = bank 2
// Row i starts at bank i%32
// smem[i][0] for all i: bank = (i * 33 + 0) % 32 = i * 33 % 32
// Consecutive i values -> different banks -> NO CONFLICT
Why does 33 work? Because gcd(33, 32) = 1. Since 33 and 32 are coprime, the bank assignments for smem[0][0], smem[1][0], smem[2][0], ..., smem[31][0] cycle through all 32 banks exactly once before repeating.
The cost: 1 extra float per row = 32 extra floats per 32x32 tile = 128 extra bytes. For a 32x32 tile of 4096 bytes, this is a 3.1% overhead in shared memory usage. Almost always worth it.
Complete Benchmark: Coalesced vs Strided vs Transposed
#include <cuda_runtime.h>
#include <stdio.h>
#define TILE_DIM 32
#define BLOCK_ROWS 8
// Benchmark 1: Simple row-major copy (fully coalesced read + write)
__global__ void copy_coalesced(
const float* __restrict__ in,
float* __restrict__ out,
int rows, int cols
) {
int x = blockIdx.x * TILE_DIM + threadIdx.x;
int y = blockIdx.y * TILE_DIM + threadIdx.y;
for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) {
if (y + j < rows && x < cols) {
out[(y + j) * cols + x] = in[(y + j) * cols + x];
}
}
}
// Benchmark 2: Strided read - thread i reads column i (non-coalesced read)
__global__ void copy_strided(
const float* __restrict__ in,
float* __restrict__ out,
int rows, int cols
) {
int x = blockIdx.x * TILE_DIM + threadIdx.x;
int y = blockIdx.y * TILE_DIM + threadIdx.y;
for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) {
if (y + j < rows && x < cols) {
// Read column-major (strided): each thread reads a different row
// Thread 0 reads row (y+j), thread 1 reads the same row but 1 col over
// All fine. But if we swap x/y:
// in[x * cols + (y+j)]: thread 0 reads in[x*cols], thread 1 reads in[(x+1)*cols]
// Stride = cols = non-coalesced
out[(y + j) * cols + x] = in[x * cols + (y + j)];
// ^^^^^^^^^^^^^^^^ STRIDED - non-coalesced
}
}
}
// Benchmark 3: Naive transpose (coalesced read, non-coalesced write)
__global__ void transpose_naive(
const float* __restrict__ in,
float* __restrict__ out,
int rows, int cols
) {
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (y < rows && x < cols) {
out[x * rows + y] = in[y * cols + x];
// Read in[y*cols+x]: for fixed y, varying x -> coalesced
// Write out[x*rows+y]: for varying x, writes to out[x*rows+y]
// x*rows means stride = rows between consecutive threads -> NOT COALESCED
}
}
// Benchmark 4: Coalesced transpose with shared memory and padding
__global__ void transpose_shared(
const float* __restrict__ in,
float* __restrict__ out,
int rows, int cols
) {
__shared__ float tile[TILE_DIM][TILE_DIM + 1]; // +1 prevents bank conflicts
int x = blockIdx.x * TILE_DIM + threadIdx.x;
int y = blockIdx.y * TILE_DIM + threadIdx.y;
// Coalesced read from global memory into smem
for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) {
if (y + j < rows && x < cols) {
tile[threadIdx.y + j][threadIdx.x] = in[(y + j) * cols + x];
}
}
__syncthreads();
// Swap block index roles for transposed output
x = blockIdx.y * TILE_DIM + threadIdx.x;
y = blockIdx.x * TILE_DIM + threadIdx.y;
// Coalesced write to global memory, reading smem in transposed order
for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) {
if (y + j < cols && x < rows) {
out[(y + j) * rows + x] = tile[threadIdx.x][threadIdx.y + j];
// ^^^^^^^^^^^^^ column-major smem read
// conflict-free due to +1 padding
}
}
}
// Timing helper
float time_kernel(void (*kernel_fn)(), dim3 grid, dim3 block,
const float* in, float* out, int rows, int cols,
int n_iters = 100) {
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
// Warm up
transpose_naive<<<grid, block>>>(in, out, rows, cols);
cudaDeviceSynchronize();
cudaEventRecord(start);
for (int i = 0; i < n_iters; i++) {
kernel_fn<<<grid, block>>>(in, out, rows, cols);
}
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float ms;
cudaEventElapsedTime(&ms, start, stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);
return ms / n_iters;
}
void run_benchmark(int rows, int cols) {
printf("\n=== %dx%d matrix ===\n", rows, cols);
size_t N = rows * cols;
float *d_in, *d_out;
cudaMalloc(&d_in, N * sizeof(float));
cudaMalloc(&d_out, N * sizeof(float));
// Initialize
float* h_in = (float*)malloc(N * sizeof(float));
for (size_t i = 0; i < N; i++) h_in[i] = (float)i;
cudaMemcpy(d_in, h_in, N * sizeof(float), cudaMemcpyHostToDevice);
dim3 block(TILE_DIM, BLOCK_ROWS);
dim3 grid((cols + TILE_DIM - 1) / TILE_DIM,
(rows + TILE_DIM - 1) / TILE_DIM);
int n_iters = 100;
float bytes = 2.0f * N * sizeof(float); // read + write
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
auto bench = [&](const char* name, auto kernel) {
cudaEventRecord(start);
for (int i = 0; i < n_iters; i++) {
kernel<<<grid, block>>>(d_in, d_out, rows, cols);
}
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float ms;
cudaEventElapsedTime(&ms, start, stop);
ms /= n_iters;
float bw = bytes / ms / 1e6f; // GB/s
printf("%-30s %6.3f ms %6.1f GB/s\n", name, ms, bw);
};
bench("copy_coalesced", copy_coalesced);
bench("copy_strided (non-coalesced)", copy_strided);
bench("transpose_naive", transpose_naive);
bench("transpose_shared (coalesced)", transpose_shared);
cudaFree(d_in); cudaFree(d_out); free(h_in);
cudaEventDestroy(start); cudaEventDestroy(stop);
}
int main() {
run_benchmark(4096, 4096);
run_benchmark(8192, 8192);
return 0;
}
Expected output on an H100 SXM5:
=== 4096x4096 matrix ===
copy_coalesced 0.214 ms 588.4 GB/s
copy_strided (non-coalesced) 1.823 ms 68.9 GB/s <- 8.5x slower
transpose_naive 1.641 ms 76.6 GB/s <- 7.7x slower
transpose_shared (coalesced) 0.281 ms 447.9 GB/s <- nearly matches copy
=== 8192x8192 matrix ===
copy_coalesced 0.843 ms 598.1 GB/s
copy_strided (non-coalesced) 7.104 ms 71.1 GB/s <- 8.4x slower
transpose_naive 6.531 ms 77.3 GB/s <- 7.7x slower
transpose_shared (coalesced) 1.092 ms 461.6 GB/s
Stride Analysis: Which Strides Cause Conflicts
For shared memory bank conflicts, a stride of floats causes -way conflicts. Let's work through the cases:
| Stride (floats) | gcd(stride, 32) | Conflict level | Notes |
|---|---|---|---|
| 1 | 1 | None | Ideal - consecutive |
| 2 | 2 | 2-way | Every other bank shared |
| 3 | 1 | None | Coprime to 32 |
| 4 | 4 | 4-way | |
| 5 | 1 | None | Coprime to 32 |
| 8 | 8 | 8-way | |
| 16 | 16 | 16-way | Severe |
| 32 | 32 | 32-way | Worst case |
| 33 | 1 | None | Why padding works |
Odd strides are always conflict-free. Even strides cause conflicts proportional to how many factors of 2 they share with 32.
// Demonstrating stride-related conflict patterns
__global__ void smem_stride_test(float* out, int stride) {
__shared__ float smem[1024];
// Initialize
smem[threadIdx.x] = threadIdx.x;
__syncthreads();
// Access with specified stride - may cause bank conflicts
int idx = (threadIdx.x * stride) % 1024;
out[threadIdx.x] = smem[idx];
// For stride=32: all 32 threads hit bank 0 -> 32-way conflict
// For stride=33: threads hit banks 0,33%32=1,66%32=2,...31 -> no conflict
// For stride=1: threads hit banks 0,1,2,...31 -> no conflict
}
Python Benchmark and Visualization
import torch
import time
import numpy as np
import matplotlib.pyplot as plt
def benchmark_access_patterns(sizes: list[int]) -> dict:
"""
Benchmark different memory access patterns using PyTorch on GPU.
Demonstrates coalescing effects at the PyTorch abstraction level.
"""
results = {"sizes": sizes, "row_copy": [], "col_copy": [], "transpose_naive": [], "transpose_contiguous": []}
for N in sizes:
A = torch.randn(N, N, device="cuda", dtype=torch.float32)
n_iters = 50
def bench(fn):
# Warm up
for _ in range(3):
fn()
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(n_iters):
fn()
torch.cuda.synchronize()
return (time.perf_counter() - t0) / n_iters * 1000 # ms
# Row-major copy: A.contiguous() - coalesced reads and writes
row_ms = bench(lambda: A.clone())
# Column-major copy: accessing every column of every row
# A[:, i] for all i - each column access is a strided gather
# PyTorch's as_strided allows us to simulate non-coalesced patterns
A_col = A.t() # transposed view - non-contiguous, strided access
col_ms = bench(lambda: A_col.contiguous()) # forces a strided read
# Transpose: naive (non-contiguous view)
trans_naive_ms = bench(lambda: A.t()) # zero-copy, just changes strides
# Transpose: actually materialize in memory (coalesced via CUDA kernel)
trans_real_ms = bench(lambda: A.t().contiguous())
bytes_moved = 2 * N * N * 4 # read + write
results["row_copy"].append(bytes_moved / row_ms / 1e6)
results["col_copy"].append(bytes_moved / col_ms / 1e6)
results["transpose_naive"].append(bytes_moved / col_ms / 1e6)
results["transpose_contiguous"].append(bytes_moved / trans_real_ms / 1e6)
print(f"N={N:5d}: row_copy={bytes_moved/row_ms/1e6:.0f} GB/s "
f"col_copy={bytes_moved/col_ms/1e6:.0f} GB/s "
f"transpose={bytes_moved/trans_real_ms/1e6:.0f} GB/s")
return results
def demonstrate_bank_conflicts() -> None:
"""
Show that non-contiguous tensor operations incur extra overhead,
analogous to shared memory bank conflicts.
The real bank conflict test requires custom CUDA kernels,
but we can observe the coalescing effect at the PyTorch level.
"""
N = 4096
A = torch.randn(N, N, device="cuda", dtype=torch.float32)
# Contiguous row-major (stride 1 in inner dim) - coalesced
row_t = A.stride()
print(f"Row-major strides: {row_t}") # (N, 1)
# Column-major view (stride N in inner dim) - non-coalesced
A_col = A.t()
col_t = A_col.stride()
print(f"Col-major strides: {col_t}") # (1, N)
# Benchmark materializing each
n_iters = 100
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(n_iters):
B = A.contiguous()
torch.cuda.synchronize()
row_ms = (time.perf_counter() - t0) / n_iters * 1000
t0 = time.perf_counter()
for _ in range(n_iters):
B = A_col.contiguous()
torch.cuda.synchronize()
col_ms = (time.perf_counter() - t0) / n_iters * 1000
bytes = 2 * N * N * 4
print(f"\nContiguous copy: {bytes/row_ms/1e6:.0f} GB/s")
print(f"Strided copy: {bytes/col_ms/1e6:.0f} GB/s")
print(f"Ratio: {col_ms/row_ms:.1f}x slower for strided")
if __name__ == "__main__":
sizes = [512, 1024, 2048, 4096]
benchmark_access_patterns(sizes)
demonstrate_bank_conflicts()
Detecting Coalescing Issues with Nsight Compute
Profiling tools are essential for diagnosing memory access problems. These are the key metrics to check:
# Check memory transaction efficiency
ncu --metrics \
l1tex__t_sectors_pipe_lsu_mem_global_op_ld_lookup_hit.sum,\
l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum \
./my_kernel
# The ratio sectors/requests tells you coalescing efficiency:
# sectors/requests = 1: perfect coalescing (1 transaction per warp instruction)
# sectors/requests = 32: completely uncoalesced (32 transactions per warp instruction)
# Check shared memory bank conflicts
ncu --metrics \
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum,\
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum \
./my_kernel
# A non-zero bank_conflicts count means you have conflicts
# Divide by total_transactions to get average conflict level
The sectors_per_request metric is the single most useful number for diagnosing coalescing:
- 1.0: perfectly coalesced
- 2.0: 2-way strided or 2-way uncoalesced
- 32.0: completely uncoalesced (each warp instruction = 32 individual transactions)
Production Notes: Coalescing in ML Kernels
Attention mechanism: The QK^T computation involves transposing K. A naive implementation loads K in column order (non-coalesced). Flash Attention's tiling explicitly ensures K is loaded in transposed-but-coalesced order by loading K^T tiles and reading them in row order.
Convolution: Input activations are typically NCHW (batch, channel, height, width). A convolution sliding window reads in non-contiguous patterns. The cuDNN and PyTorch backends preprocess to NHWC or im2col formats specifically to ensure coalesced access.
Embedding lookup: In NLP models, embedding lookup is a gather operation: different threads look up different rows of the embedding table. Unless the lookup indices happen to be sorted and consecutive (rare in practice), this is an inherently non-coalesced access. Solutions: (1) sort indices before lookup, (2) use smaller embeddings that fit in L2 cache, (3) use vectorized loads to amortize the non-coalesced overhead.
Batch matrix multiply: When computing batch matmul with shape (B, M, N) x (B, N, K), ensure the batch dimension is handled correctly. If threads are assigned along the batch dimension first, then the inner matrix accesses are coalesced. If threads are scattered across batches, you get strided access.
Common Mistakes
:::danger Column-Major Access on Row-Major Arrays Reading a matrix column-by-column when it is stored row-by-row creates stride-N access (where N is the matrix width). For a 1024-column matrix, each consecutive thread's memory address is 1024 floats = 4096 bytes apart. The hardware issues 32 separate 128-byte transactions per warp instruction instead of 1. This is the single most common source of poor memory bandwidth utilization in CUDA kernels.
// WRONG - stride = cols = non-coalesced
float val = matrix[col * rows + threadIdx.x]; // reading column-major
// RIGHT - stride = 1 = coalesced
float val = matrix[row * cols + threadIdx.x]; // reading row-major
:::
:::warning Forgetting the +1 Padding for Shared Memory Column reads from a 32-wide shared memory array always cause 32-way bank conflicts. This is a 32x slowdown on the shared memory read. The fix (one extra element per row) costs 3% extra shared memory but eliminates all conflicts.
// WRONG - 32-way bank conflicts on column reads
__shared__ float smem[32][32];
// RIGHT - conflict-free column reads
__shared__ float smem[32][32 + 1];
// or equivalently:
__shared__ float smem[32][33];
Always add padding when your kernel reads shared memory in column order (smem[i][constant] with varying i). :::
:::danger Assuming PyTorch's .t() is Coalesced
tensor.t() in PyTorch returns a view with swapped strides - it does not move any data. This means all subsequent operations on the transposed tensor use the non-contiguous strided layout, which is non-coalesced in global memory. Calling .t().contiguous() forces the actual transpose, which PyTorch implements with a coalesced shared-memory kernel internally. But many operations (linear, matmul) will call .contiguous() implicitly when needed - and this silent copy can be expensive.
A = torch.randn(4096, 4096, device="cuda")
# This is ZERO cost - just a metadata change
B = A.t()
# This forces an actual memory operation (good - uses coalesced kernel)
C = A.t().contiguous()
# Watch out: in a hot training loop, frequent .t().contiguous() calls
# appear as unexpected memory traffic in the profile
:::
:::warning Misidentifying the Conflict Source When profiling shows high shared memory bank conflicts, engineers often immediately add +1 padding without understanding which access pattern is conflicting. There are two distinct access patterns to check: the store into shared memory (when loading from global) and the load from shared memory (when writing to global or computing). Only the conflicting one needs the fix. Adding unnecessary padding wastes shared memory and can reduce occupancy.
Profile with:
ncu --metrics l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum,\
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum ./my_kernel
_op_ld vs _op_st tells you whether loads or stores are conflicting.
:::
Interview Q&A
Q1: What is a coalesced memory access and why does it matter for GPU performance?
A coalesced memory access is when all 32 threads in a warp access addresses that fall within a single 128-byte aligned cache line, allowing the hardware to serve all 32 threads with a single memory transaction. The GPU's memory controller is designed to handle these coalesced transactions efficiently at maximum bandwidth.
When threads access non-consecutive addresses (strided or scattered), the hardware must issue multiple transactions - one per cache line touched. In the worst case (stride = matrix width), 32 threads issue 32 separate transactions to serve 32 individual floats. Each transaction fetches 128 bytes but only 4 bytes are used. Memory bandwidth utilization drops from 100% to 3.1%.
On an H100, peak global memory bandwidth is 3.35 TB/s. A fully coalesced kernel can approach this. A fully uncoalesced kernel might achieve 70-100 GB/s - a 30-40x difference on the same hardware.
Q2: How does a matrix transpose cause non-coalesced access, and how does shared memory fix it?
In a naive transpose, thread (row, col) reads A[row][col] and writes B[col][row]. When we look at 32 threads in a warp that share the same row but span columns 0 to 31: the reads from A are to A[row][0..31] - consecutive addresses, coalesced. The writes to B are to B[0][row], B[1][row], ..., B[31][row]. B is stored row-major, so B[col][row] = B_ptr + col * num_rows + row. Consecutive threads write to addresses separated by num_rows floats - a stride of num_rows - completely non-coalesced.
The shared memory fix: load a 32x32 tile from A row-by-row (coalesced reads), store it in shared memory. Then swap which block dimension goes to which output index, and read from shared memory in transposed order (column-major from smem, which is cheap on-chip) to write to B row-major (coalesced writes). The transpose happens in shared memory at ~32 cycle cost instead of incurring non-coalesced global memory transactions.
Q3: What is a shared memory bank conflict and how do you resolve it with padding?
Shared memory is physically divided into 32 banks, each 4 bytes wide. Word is in bank . A bank conflict occurs when two or more threads in the same warp instruction access different addresses in the same bank, forcing the hardware to serialize those accesses.
The most common cause: reading a 32-wide shared memory array column by column. Thread accesses smem[i][col]. Bank of smem[i][col] = (i * 32 + col) % 32 = col % 32 for all i. Every thread hits the same bank - a 32-way conflict, 32x serialization.
The fix is padding: declare smem[32][33] instead of smem[32][32]. Now thread accesses smem[i][col] at bank . Since 33 and 32 are coprime (gcd = 1), consecutive threads hit distinct banks. Zero conflicts. The cost is one extra float per row = 3.1% more shared memory.
Q4: When would you choose to change the data layout rather than using shared memory tiling to fix a coalescing problem?
If the non-coalesced access pattern is fundamental to the algorithm and the transposition cost is a one-time preprocessing cost (not inside the training loop), change the layout. Example: if you always access a matrix column-by-column, storing it in column-major order (transposed) makes those accesses coalesced without needing shared memory overhead.
For operations on persistent data (model weights, embedding tables), pre-transposing or re-laying-out the data is free at inference time. For activations inside a training loop where the same tensor is used by multiple operations with different access patterns, shared memory tiling is better because a permanent layout change would make one operation faster but another slower.
The rule: if you access the same data with the same access pattern many times, change the layout. If you access it twice with different patterns (once row-wise, once column-wise), use smem as a transposing buffer.
Q5: Why is stride 33 conflict-free when stride 32 causes a 32-way conflict?
For stride , thread accesses smem[i * s] which is in bank . For : bank for all . Every thread hits bank 0.
For : bank . We need to know if generates all distinct values for . Since , we have . So the banks are - all distinct. No conflicts.
More generally: stride is conflict-free if and only if (coprime to 32). Any odd stride is conflict-free. Strides that are multiples of 2 but not 32 cause partial conflicts. Strides that are multiples of 32 cause complete 32-way conflicts (or broadcast if all addresses are identical).
Q6: In a profiler, how do you distinguish between a coalescing problem and a bank conflict problem?
They show up in different metrics:
-
Coalescing:
l1tex__t_sectors_pipe_lsu_mem_global_op_ld_lookup_hit.sumdivided byl1tex__t_requests_pipe_lsu_mem_global_op_ld.sum. If this ratio (sectors per request) is >> 1, you have non-coalesced global memory access. -
Bank conflicts:
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sumand the_op_stvariant. Non-zero values mean you have shared memory bank conflicts. The value is the number of extra serial transactions caused by conflicts.
Coalescing problems are in global memory access. Bank conflict problems are in shared memory access. They are independent issues. A kernel can have one, both, or neither.
Q7: What is the performance impact of accessing global memory with stride 1 vs stride 4 vs stride 32?
For a warp of 32 threads with float32 data:
-
Stride 1: all 32 threads read 32 consecutive floats = 1 x 128-byte transaction. Efficiency = 100%.
-
Stride 4: threads read addresses at 0, 16, 32, ..., 496 bytes. These span 4 cache lines (0-127, 128-255, 256-383, 384-511 bytes). The hardware issues 4 transactions, each 128 bytes, for 32 floats needed. Efficiency = 25%.
-
Stride 32: threads read addresses at 0, 128, 256, ..., 3968 bytes. Each thread hits a different 128-byte cache line. 32 transactions for 32 floats. Efficiency = 3.1%.
For L2 cache hits, the raw latency is lower but the transaction overhead scales the same way. For HBM misses, stride-32 access is approximately 20-30x slower than stride-1 on real workloads due to the combination of extra transactions and serialization at the memory controller.
Summary
Memory coalescing and bank conflicts are the two access patterns that determine whether your kernel uses 3% or 95% of available memory bandwidth. They operate at different levels of the hierarchy but both reduce to the same root cause: unplanned stride in memory access.
For global memory: keep consecutive threads accessing consecutive addresses. When the algorithm requires transposing (row vs column access), use shared memory as an intermediary - load coalesced, rearrange in smem, write coalesced.
For shared memory: add +1 padding to the shared dimension when your kernel reads smem in column order. The cost is 3% extra smem. The benefit is eliminating 32-way serialization on every smem read.
Profile first. Measure sectors_per_request for global memory coalescing and bank_conflicts for shared memory. These two numbers tell you where your memory efficiency is going and exactly what to fix.
