Profiling with Nsight
Reading time: ~40 min · Interview relevance: High · Target roles: CUDA Developer, Performance Engineer, ML Systems Engineer
The Kernel That Looked Fine Until It Didn't
It is 2 AM. Your training job is running. The loss curve looks reasonable. The GPU is "doing something" - you can see activity in nvidia-smi. You believe the code is working correctly.
Then someone on the team asks: "Are we actually using the hardware efficiently?" You open nvidia-smi dmon and see SM utilization hovering at 23%. The GPU is physically present. The kernel is launching. But three quarters of the compute fabric is idle.
You add more threads. 23% becomes 24%. You try a different block size. Still 23%. You read a forum post that says to increase occupancy. You try that too. Still 23%.
The problem is that you are optimizing by intuition - changing things and hoping the number improves. That is not engineering. That is gambling. And on the GPU, your intuition is almost always wrong because the actual constraint is invisible to you: it might be L2 cache thrash, warp divergence, atomic contention, an uncoalesced memory access pattern, or a pipeline hazard causing stalls that compound across 10,000 warps. No amount of staring at source code reveals which one it is.
This is exactly the situation Nsight was built to resolve. You attach the profiler, profile for 10 minutes, and the answer is right there in black and white: "Stall - Long Scoreboard: 67% of all warp cycles." That means warps are waiting for L2 reads to return. That is a specific, actionable diagnosis. You reorganize the memory access pattern to improve L2 locality, re-profile, and SM efficiency climbs to 71%.
Ten minutes of profiling saved you three days of guessing.
This lesson teaches you how to use the full profiling stack: Nsight Systems for system-level pipeline analysis, Nsight Compute for kernel-level deep dives, and the PyTorch built-in profiler for Python-level training loop analysis. More importantly, it teaches you how to read what these tools tell you so that every optimization decision you make is grounded in data.
Why This Exists - The Problem With Performance Intuition
Before profiling tools existed, GPU developers optimized by writing code, running it, measuring wall-clock time, changing something, and measuring again. This approach has a fatal flaw: wall-clock time is a summary statistic. It tells you that something is slower, but not what is slow or why.
Consider two kernels that both run in 5 ms on an A100. One is memory-bound, starving for bandwidth. The other is compute-bound, bottlenecked on FP32 throughput. The optimization path for these two kernels is completely opposite. If you apply compute optimizations to a memory-bound kernel - loop unrolling, instruction-level parallelism, instruction scheduling - you will see zero improvement, because compute is not the constraint. You just wasted a week.
Professional GPU engineers do not optimize kernels. They optimize bottlenecks. And the only way to find the bottleneck is measurement.
A second problem with intuition-based optimization is that modern GPUs execute thousands of threads simultaneously across a complex pipeline of caches, memory controllers, and execution units. The interaction effects are non-linear and non-obvious. A seemingly innocent change to one kernel can alter cache pressure in a way that degrades a completely different kernel running in the same process. Without a profiler, you cannot see these effects.
Nsight Compute and Nsight Systems give you hardware performance counters - silicon-level measurements of what the GPU actually did, not estimates. SM activity cycle counts, L1 hit rates, HBM read bytes, warp issue cycles, instruction mix - this is the ground truth. When the profiler says your kernel spent 67% of cycles stalled waiting for memory, that is the GPU itself reporting what happened.
Historical Context - From Visual Profiler to Nsight
The history of NVIDIA profiling tools mirrors the history of GPU computing itself.
In the early CUDA era (2007-2010), developers used cudaEvent timing and wall-clock measurements. This told you how long something took but nothing about why.
NVIDIA Visual Profiler (nvvp) arrived around 2010 and was the first tool to show a timeline - a graphical view of kernel launches, memory transfers, and CPU activity laid out on a time axis. This was transformative. Suddenly you could see that your GPU was idle for 40% of the training step while the CPU was preparing the next batch. Pipeline bubbles became visible.
nvprof followed as the command-line counterpart to nvvp. It collected performance counters and produced detailed kernel statistics. It was widely used through the Pascal and Volta generation.
Then Turing arrived (2018), and NVIDIA completely rebuilt the profiling infrastructure. The new hardware had a much richer set of performance counters, but accessing them required a privileged hardware sampling mechanism that was incompatible with nvvp and nvprof. NVIDIA deprecated both tools and replaced them with:
- Nsight Systems (nsys): the successor to Visual Profiler for timeline and pipeline analysis
- Nsight Compute (ncu): the successor to nvprof for per-kernel metrics and roofline analysis
Both tools are shipping today and are the standard for all Ampere, Hopper, and Ada Lovelace work. If you see tutorials still using nvprof, they are outdated - nvprof does not support hardware counters on Turing and newer.
Two Tools, Two Levels of Analysis
The single most important thing to understand about GPU profiling is that Nsight Systems and Nsight Compute answer fundamentally different questions.
Nsight Systems Nsight Compute
───────────────── ───────────────
"What is the system doing?" "Why does this kernel perform this way?"
System-level timeline Kernel-level hardware metrics
CPU + GPU together One kernel at a time
Overlap, gaps, serialization Roofline, stalls, cache hit rates
Fast (low overhead) Slow (replays kernel multiple times)
Use first Use second, on specific kernels
The workflow is always: Nsight Systems first to identify which part of your pipeline is the bottleneck, then Nsight Compute on that specific kernel to understand why it is slow. Never skip the first step and go straight to Nsight Compute on a random kernel - you will profile the wrong thing.
Nsight Systems - System-Level Timeline Analysis
What Nsight Systems Shows You
Nsight Systems records a timeline of everything happening in your process: every CUDA kernel launch, every memory copy, every CUDA API call, CPU thread activity, PCIe transfer, NCCL communication, and PyTorch operator execution. It answers questions like:
- Is the GPU sitting idle between kernels while the CPU prepares data?
- Are H2D and D2H transfers overlapping with compute, or are they serialized?
- Which NCCL collective is the bottleneck in your distributed training?
- Is your DataLoader actually keeping the GPU fed?
- How long does each forward pass, backward pass, and optimizer step take?
Running Nsight Systems
# Profile a full training script
nsys profile \
--trace=cuda,nvtx,osrt,cudnn,cublas \
--output=profile_output \
python train.py --epochs 1 --max-steps 50
# Opens the .nsys-rep file in Nsight Systems GUI
nsys-ui profile_output.nsys-rep
# Or generate a stats report from command line
nsys stats profile_output.nsys-rep
Key flags:
--trace=cudacaptures CUDA API calls and kernel launches--trace=nvtxcaptures NVTX range annotations you add to your code--trace=cudnncaptures cuDNN calls (convolutions, attention, etc.)--trace=cublascaptures cuBLAS calls (GEMMs)--outputsets the output file name--delayand--durationlet you skip startup and capture a steady-state window
Reading the Timeline
The Nsight Systems timeline has several rows:
- CPU threads: shows which threads are active, what they are calling
- CUDA API: cudaMemcpyAsync, cudaLaunchKernel, cudaStreamSynchronize calls
- CUDA HW: actual GPU execution - this is the ground truth of what the GPU did
- Memory: PCIe transfers between host and device
- NCCL: collective communication operations (if using multi-GPU)
The most important thing to look for is GPU idle time - gaps in the CUDA HW row. Every gap is wasted hardware. Common causes:
- CPU-GPU synchronization:
cudaDeviceSynchronize()ortorch.cuda.synchronize()blocking the CPU, causing the GPU to drain its queue and sit idle while the CPU does work - DataLoader bottleneck: GPU finishes a batch faster than the DataLoader can prepare the next one
- Memory transfer serialization: H2D and compute not overlapping (streams not used)
- Launch overhead: many small kernels with gaps between them (kernel fusion needed)
NVTX Annotations - Labeling Your Timeline
Without annotations, the CUDA HW row shows a dense wall of kernel names like void cudnn::detail::wgrad_alg0_engine... which is useless for understanding what your training loop is doing. NVTX lets you add human-readable labels.
import torch
import torch.cuda.nvtx as nvtx
for step, (inputs, labels) in enumerate(dataloader):
nvtx.range_push(f"step_{step}")
nvtx.range_push("data_transfer")
inputs = inputs.cuda(non_blocking=True)
labels = labels.cuda(non_blocking=True)
nvtx.range_pop()
nvtx.range_push("forward")
outputs = model(inputs)
loss = criterion(outputs, labels)
nvtx.range_pop()
nvtx.range_push("backward")
loss.backward()
nvtx.range_pop()
nvtx.range_push("optimizer")
optimizer.step()
optimizer.zero_grad()
nvtx.range_pop()
nvtx.range_pop() # step_N
if step >= 10:
break
With these annotations, the Nsight Systems timeline shows color-coded regions for each phase. You can immediately see how long forward vs. backward vs. optimizer takes, and whether any phase has unexpected gaps inside it.
Identifying Pipeline Bubbles
A pipeline bubble is a period where the GPU is idle because it is waiting for work. In a training loop this typically looks like:
Timeline (time flows right):
───────────────────────────────────────────────────────────
CPU: [prepare_batch] [wait] [prepare_batch] [wait]
GPU: [forward+backward] [forward+backward]
↑ bubble here ↑ ↑ bubble here ↑
The fix is prefetching: start loading batch N+1 to the GPU while the GPU is processing batch N. This requires pinned memory and non-blocking transfers, covered in the CUDA Streams lesson.
Nsight Compute - Kernel-Level Deep Dive
What Nsight Compute Shows You
Once Nsight Systems tells you which kernel is slow, Nsight Compute tells you why. It replays the kernel multiple times, each time collecting a different set of hardware counters, then aggregates them into a comprehensive profile.
Key sections in a Nsight Compute report:
- GPU Speed of Light: quick summary - are you compute-bound or memory-bound?
- Roofline Chart: visual position of your kernel relative to hardware limits
- Memory Workload Analysis: L1/L2/HBM bandwidth and hit rates
- Compute Workload Analysis: SM throughput, warp efficiency, instruction mix
- Scheduler Statistics: warp stall reasons, issued warps per cycle
- Source Counters: which lines of source code are the hottest
Running Nsight Compute
# Profile a specific kernel with full metrics
ncu --set full \
--kernel-name "my_kernel" \
--launch-skip 5 \
--launch-count 3 \
--output profile_kernel \
python -c "
import torch
from my_module import run_kernel
# warmup
for _ in range(5):
run_kernel()
torch.cuda.synchronize()
# profiled runs
for _ in range(3):
run_kernel()
torch.cuda.synchronize()
"
# Open in GUI
ncu-ui profile_kernel.ncu-rep
# Or print report to terminal
ncu --import profile_kernel.ncu-rep --print-summary per-kernel
Key flags:
--set fullcollects all available metrics (slow but comprehensive)--set rooflinecollects metrics needed for the roofline chart (faster)--kernel-namefilters to a specific kernel by name (supports regex)--launch-skip Nskips the first N launches (skip warmup iterations)--launch-count Nprofiles N launches then stops--outputsets the output file name
:::warning Nsight Compute Overhead
--set full replays each kernel ~20 times to collect all counters. This makes the profiled run 10-50x slower than normal. Never profile a long training run with --set full. Profile 1-3 iterations of a specific kernel.
:::
Reading the GPU Speed of Light Section
The first section you see in Nsight Compute is "GPU Speed of Light" - a high-level scorecard:
SM Throughput: 23.4% ← percentage of peak SM compute used
Memory Throughput: 87.6% ← percentage of peak HBM bandwidth used
This immediately tells you the bottleneck category:
- Memory throughput >> SM throughput: kernel is memory-bound. Optimizing compute will not help.
- SM throughput >> Memory throughput: kernel is compute-bound. Reducing arithmetic will not help.
- Both low: kernel has a launch configuration problem (too few threads, occupancy too low) or there are significant synchronization stalls.
The Roofline Chart
The roofline chart plots your kernel's position relative to two hardware limits: compute throughput (FLOPS) and memory bandwidth (bytes/sec). The "roof" is the lower of these two limits at your kernel's arithmetic intensity.
FLOP/s
│ ╱ compute ceiling (312 TFLOPS on A100)
│ ╱
│ ┌────╱──────────────────────────────────
│ │ /
│ │ / ← memory bandwidth slope (2 TB/s on A100)
│ │ /
│ │/
│ ● ← your kernel
└────┼────────────────────────────────── FLOP/byte
↑
Arithmetic intensity
Your kernel is below the roofline = achievable performance gap exists
If your kernel is far below the roofline - specifically far below the memory bandwidth slope - it means you are not even saturating memory bandwidth. There is a hidden bottleneck beyond simple memory vs. compute: likely warp stalls, bad access patterns, or occupancy limitations.
If your kernel sits near the roofline but below it, you are close to optimal for that compute-to-memory ratio. The only way to go higher is to change the algorithm to improve arithmetic intensity (e.g., tiling, kernel fusion).
Memory Workload Analysis
The memory section breaks down where your kernel is spending memory traffic:
L1 Cache Hit Rate: 45.2% ← fraction of L1 reads that hit cache
L2 Cache Hit Rate: 73.8% ← fraction of L2 reads that hit cache
HBM Read Bandwidth: 1.24 TB/s (peak: 2.0 TB/s)
HBM Write Bandwidth: 0.31 TB/s
A low L1 hit rate often indicates uncoalesced global memory accesses. When threads in a warp access non-contiguous memory addresses, CUDA cannot merge them into a single wide transaction and instead issues multiple narrow transactions, inflating total memory traffic.
A low L2 hit rate often indicates poor spatial locality - the working set is too large to fit in L2, forcing frequent HBM accesses.
HBM bandwidth near peak on a memory-bound kernel is actually good - it means you are efficiently using available bandwidth. HBM bandwidth well below peak on a memory-bound kernel suggests the access pattern has issues (random accesses, bank conflicts on shared memory).
Warp Stall Reasons - The Most Actionable Metric
The scheduler statistics section shows the distribution of warp stall reasons. This is often the most directly actionable information in the entire profile.
When a warp cannot issue an instruction in a cycle, it is stalled. The GPU records why it stalled. The major categories:
| Stall Reason | What It Means | Common Fix |
|---|---|---|
| Long Scoreboard | Waiting for L2/HBM read to return | Improve data locality, prefetching |
| Short Scoreboard | Waiting for L1/shared memory read | Reduce shared memory bank conflicts |
| Synchronization | Waiting at __syncthreads() | Reduce divergence before sync, reorganize |
| Execution Dependency | Output of prev instruction not ready | Instruction-level interleaving |
| Memory Throttle | Memory subsystem is saturated | Reduce memory traffic, improve reuse |
| Wait | Waiting for dependent warp to finish | Increase parallelism |
| Not Selected | Warp is ready but scheduler chose another | Not a problem - this is normal |
"Long Scoreboard" at 60%+ is the most common finding in production ML kernels. It means your kernel is waiting on L2 or HBM reads. The fix is always some form of improving data reuse: blocking, tiling, caching in shared memory, or restructuring access patterns to improve L2 locality.
"Synchronization" stalls at 40%+ often indicate a barrier that is waiting because some warps diverge before reaching it. Check for conditionals inside __syncthreads() regions.
Achieved vs. Theoretical Occupancy
Occupancy is the ratio of active warps to the maximum possible warps on an SM.
Nsight Compute shows:
- Theoretical occupancy: what the hardware can theoretically support given your kernel's resource usage (registers per thread, shared memory per block)
- Achieved occupancy: what actually happened at runtime
A large gap between theoretical and achieved usually means some blocks are serialized due to resource contention or the kernel runtime is so short that occupancy cannot build up before the kernel finishes.
Low occupancy is not always the problem. A kernel with 25% occupancy but a high L1 hit rate and no stalls can outperform a kernel at 100% occupancy that is constantly stalling on memory. Occupancy is a tool, not a goal.
flowchart TD for occupancy analysis:
High stalls + Low occupancy → increase occupancy (reduce registers, reduce shmem)
High stalls + High occupancy → fix the stall reason (not an occupancy problem)
Low stalls + High throughput → already optimal
Low stalls + Low throughput → check launch config, arithmetic intensity
Mermaid: Nsight Profiling Workflow
PyTorch Built-In Profiler
For ML engineers working primarily in PyTorch, the built-in profiler offers a tighter integration with the training loop than running Nsight externally. It captures both Python-level operator timing and CUDA kernel timing in a single trace.
Basic Usage
import torch
from torch.profiler import profile, record_function, ProfilerActivity
model = MyModel().cuda()
optimizer = torch.optim.Adam(model.parameters())
inputs = torch.randn(32, 3, 224, 224).cuda()
labels = torch.randint(0, 1000, (32,)).cuda()
with profile(
activities=[
ProfilerActivity.CPU,
ProfilerActivity.CUDA,
],
record_shapes=True, # record tensor shapes
profile_memory=True, # track tensor allocation/deallocation
with_stack=True, # capture Python call stack
on_trace_ready=torch.profiler.tensorboard_trace_handler("./log/profiler"),
) as prof:
for step in range(10):
with record_function("forward"):
outputs = model(inputs)
loss = torch.nn.functional.cross_entropy(outputs, labels)
with record_function("backward"):
loss.backward()
with record_function("optimizer"):
optimizer.step()
optimizer.zero_grad()
prof.step()
# Print summary table sorted by CUDA time
print(prof.key_averages().table(
sort_by="cuda_time_total",
row_limit=20,
))
Profiler Schedule - Avoiding Full-Run Overhead
The torch.profiler.schedule function lets you profile only a window of steps, avoiding the overhead of profiling an entire training run:
from torch.profiler import profile, schedule, ProfilerActivity
# Skip first 5 steps (warmup), profile steps 5-8, then repeat
my_schedule = schedule(
skip_first=5,
wait=1, # step 5: collect nothing (let GPU settle)
warmup=1, # step 6: start tracing but discard (JIT warmup)
active=3, # steps 7-9: actually record
repeat=1, # do this cycle once then stop
)
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
schedule=my_schedule,
on_trace_ready=torch.profiler.tensorboard_trace_handler("./log/profiler"),
) as prof:
for step in range(20):
train_step(model, optimizer, inputs, labels)
prof.step()
Reading the Profiler Output
The key_averages() table shows:
--------------------------------- ------------ ------------ ------------
Name CPU total % CUDA total % CUDA time avg
--------------------------------- ------------ ------------ ------------
forward 12.3% 45.2% 4.52ms
backward 18.7% 38.9% 3.89ms
aten::convolution 8.1% 28.3% 1.42ms
aten::mm 2.3% 15.1% 0.76ms
aten::relu_ 0.8% 3.2% 0.16ms
cudaMemcpyAsync 5.2% 0.0% 0.26ms
--------------------------------- ------------ ------------ ------------
High "CPU total %" with low "CUDA total %" on an operation often indicates a Python-level bottleneck or excessive CPU-GPU synchronization. High "CUDA total %" on a single operation like aten::mm (matrix multiply) is expected for transformer models.
Chrome Trace Export
# Export to Chrome trace format for visualization
prof.export_chrome_trace("trace.json")
Open chrome://tracing in Chrome, load trace.json, and you get an interactive timeline similar to Nsight Systems but for Python-level operators. This is extremely useful for identifying which PyTorch ops are consuming time and whether they overlap correctly.
Memory Profiling
with profile(
activities=[ProfilerActivity.CUDA],
profile_memory=True,
record_shapes=True,
) as prof:
outputs = model(inputs)
loss = outputs.sum()
loss.backward()
# Show memory allocation events
print(prof.key_averages().table(
sort_by="self_cuda_memory_usage",
row_limit=15,
))
The self_cuda_memory_usage column shows how much GPU memory each operator allocates. This is valuable for finding unexpected allocations during the backward pass - a common cause of OOM errors is operators that allocate large intermediate tensors that were not anticipated.
Complete Profiling Workflow: A Worked Example
Here is a realistic end-to-end profiling session for a custom attention kernel.
Step 1 - Establish the Baseline
import torch
import time
# Simple timing baseline
def time_kernel(fn, *args, n_warmup=5, n_iter=20):
for _ in range(n_warmup):
fn(*args)
torch.cuda.synchronize()
start = time.perf_counter()
for _ in range(n_iter):
fn(*args)
torch.cuda.synchronize()
end = time.perf_counter()
return (end - start) / n_iter * 1000 # ms
q = torch.randn(4, 8, 512, 64, device="cuda", dtype=torch.float16)
k = torch.randn(4, 8, 512, 64, device="cuda", dtype=torch.float16)
v = torch.randn(4, 8, 512, 64, device="cuda", dtype=torch.float16)
ms = time_kernel(my_attention, q, k, v)
print(f"Kernel time: {ms:.3f} ms")
Step 2 - Nsight Systems for Pipeline View
nsys profile \
--trace=cuda,nvtx \
--output=attn_profile \
python -c "
import torch
import torch.cuda.nvtx as nvtx
# Import your kernel
from my_attention import my_attention
q = torch.randn(4, 8, 512, 64, device='cuda', dtype=torch.float16)
k = torch.randn(4, 8, 512, 64, device='cuda', dtype=torch.float16)
v = torch.randn(4, 8, 512, 64, device='cuda', dtype=torch.float16)
# warmup
for _ in range(5):
my_attention(q, k, v)
torch.cuda.synchronize()
# profiled iterations with NVTX labels
for i in range(10):
nvtx.range_push(f'attn_iter_{i}')
my_attention(q, k, v)
torch.cuda.synchronize()
nvtx.range_pop()
"
Look at the timeline. Is there a single kernel taking most of the time? Are there multiple kernels? Are there gaps (CPU-GPU sync points)?
Step 3 - Nsight Compute on the Slow Kernel
# Identify kernel name from nsys output, then deep-profile it
ncu \
--set full \
--kernel-name "my_attention_kernel" \
--launch-skip 5 \
--launch-count 3 \
--output attn_kernel_profile \
python -c "
import torch
from my_attention import my_attention
q = torch.randn(4, 8, 512, 64, device='cuda', dtype=torch.float16)
k = torch.randn(4, 8, 512, 64, device='cuda', dtype=torch.float16)
v = torch.randn(4, 8, 512, 64, device='cuda', dtype=torch.float16)
for _ in range(10):
my_attention(q, k, v)
torch.cuda.synchronize()
"
Step 4 - Interpret and Fix
Suppose the profile shows:
- Memory Throughput: 89% of peak - kernel is memory-bound
- Long Scoreboard stalls: 62% of warp cycles
- L2 Hit Rate: 41%
Diagnosis: the attention score matrix access pattern has poor L2 locality. The fix is to tile the computation so that each tile fits in L2 cache before moving on. This is exactly what FlashAttention does - it tiles the Q, K, V matrices to exploit SRAM (shared memory) and L2 locality, reducing HBM traffic by ~4-5x.
Step 5 - Re-profile to Verify
After applying the fix, repeat the exact same profiling commands. Compare the metrics side-by-side:
Before After
Memory Throughput: 89% 76% (still memory-bound but less traffic)
Long Scoreboard: 62% 28% (stalls cut in half)
L2 Hit Rate: 41% 78% (tiling works)
HBM reads: 9.2 GB 2.1 GB (4.4x less HBM traffic)
Kernel time: 4.8ms 1.1ms (4.4x speedup)
The profile tells you exactly why the speedup happened, which gives you confidence that the optimization is principled and not a lucky accident that breaks on different hardware.
Production Profiling Notes
Profile representative workloads. A kernel that looks fast with batch size 1 may look completely different at batch size 64. Profile with the production batch size and sequence length.
Disable Python overhead during CUDA profiling. Python's GIL and object allocation can mask CUDA performance issues. When profiling CUDA kernels, minimize the Python code between kernel launches.
Use --launch-skip to avoid profiling warmup. GPU caches are cold on the first few kernel runs. The first iteration always looks slower. Skip at least 3-5 iterations before profiling.
Profile mixed precision separately from FP32. FP16 and BF16 kernels have different arithmetic intensity and may hit different bottlenecks than their FP32 counterparts. Always profile the exact precision you ship.
Beware of power throttling. NVIDIA GPUs throttle clock speed when power or temperature limits are hit. Profile results at sustained load may differ from burst benchmarks. Run a warmup of 30+ seconds before taking measurements.
Nsight Compute changes kernel timing. Because ncu replays kernels to collect counters, the execution order and timing are different from normal execution. The metrics are accurate but you cannot use ncu to measure wall-clock performance. Use cudaEvent timing for performance numbers.
Common Mistakes
:::danger Optimizing Before Profiling The most expensive mistake in GPU optimization is choosing what to optimize based on intuition. "I think the matrix multiply is the bottleneck" is a hypothesis, not a fact. Profile first. The bottleneck is almost never where you think it is. Spending a week optimizing the wrong thing is not just wasted time - it often makes the real bottleneck harder to see. :::
:::danger Profiling Warmup Iterations
GPU kernels are slower on the first few executions: JIT compilation has not run, instruction caches are cold, and L2 is empty. Profiling the first execution captures initialization overhead, not steady-state performance. Always skip at least 3-5 iterations with --launch-skip or skip_first in the PyTorch profiler schedule.
:::
:::warning Interpreting Occupancy as a Goal High occupancy is a means to hide memory latency, not a goal in itself. A kernel at 100% occupancy that stalls constantly on Long Scoreboard is slower than a kernel at 50% occupancy with good L1 hit rates. Do not chase occupancy numbers. Chase stall reduction and memory efficiency. :::
:::warning Nsight Compute Changes Execution Behavior When ncu replays a kernel to collect counters, it runs the kernel in isolation, skipping all other concurrency. A kernel that normally overlaps with a memory transfer will not show that overlap in ncu. This is by design - ncu is measuring the kernel in isolation. For overlap analysis, use nsys. :::
:::warning Small Kernels Have High Launch Overhead Fraction A kernel that runs in 2 microseconds will show very high "launch overhead" in the profiler because the ~5 microsecond kernel launch overhead is a significant fraction of total time. The fix is kernel fusion, not internal kernel optimization. If Nsight shows you have 500 tiny kernels launching per training step, fusion is the answer. :::
Interview Questions and Answers
Q1: What is the difference between Nsight Systems and Nsight Compute, and when do you use each?
Nsight Systems is a system-level profiler that records a timeline of the entire process: CPU threads, CUDA API calls, kernel launches, memory transfers, NCCL operations, and PyTorch operators, all aligned on a single time axis. It answers "what is the pipeline doing and where are the gaps?" It has low overhead and you use it first to find which parts of your training loop are slow.
Nsight Compute is a kernel-level profiler that collects hardware performance counters for a specific kernel by replaying it multiple times. It answers "why does this kernel perform this way?" - showing roofline position, warp stall reasons, L1/L2/HBM hit rates, and achieved occupancy. It has high overhead and you use it second, on the specific kernel Nsight Systems identified as the bottleneck.
The workflow is always: nsys first to identify the slow kernel, ncu second to diagnose why.
Q2: A kernel has 75% SM utilization but low throughput. What Nsight metrics would you investigate?
High SM utilization with low throughput is a strong signal that warps are active but stalled - the SMs are technically "busy" (warps are assigned) but most cycles are wasted waiting for something. I would look at:
-
Warp stall reasons in Scheduler Statistics: if "Long Scoreboard" is dominant, warps are waiting on L2/HBM reads - memory latency is the bottleneck. If "Synchronization" is high, there is barrier overhead. If "Execution Dependency" is high, there are data-dependent instruction chains.
-
Memory Workload Analysis: check L1 and L2 hit rates. Low L2 hit rate with high Long Scoreboard stalls means the working set is thrashing L2, forcing HBM reads.
-
Achieved vs. theoretical occupancy: a large gap suggests resource contention preventing full occupancy from being reached.
-
Warp issue efficiency: the fraction of cycles where the warp scheduler issues an instruction. If this is low despite high active warp count, the warps are all stalled simultaneously - not enough independent work to hide latency.
Q3: What does "Stall - Long Scoreboard" mean in Nsight Compute, and how do you fix it?
The scoreboard is the hardware mechanism that tracks which registers have pending results. When a warp issues a load instruction, the destination registers are marked "pending" in the scoreboard. Any subsequent instruction that depends on those registers cannot execute until the load completes and the scoreboard clears.
"Long Scoreboard" stalls mean warps are waiting for L2 or HBM reads to return. "Long" refers to the long latency of off-chip memory accesses (hundreds of cycles for HBM). The warp is literally sitting idle, waiting for data.
Fixes in order of preference:
- Improve data locality: restructure access patterns so data is reused and stays in L1/L2 cache. Tiling is the primary technique.
- Increase occupancy: more warps per SM means more warps available to hide the latency of stalled warps. If warp A is stalled on an HBM load, warp B, C, D can issue instructions in the meantime.
- Software prefetching: issue load instructions many cycles before the data is needed, using
__ldg()cache hints or explicit prefetch intrinsics. - Reduce memory footprint: smaller working set fits better in L2, improving hit rates.
Q4: How do you use NVTX annotations to make Nsight Systems more useful in PyTorch?
NVTX (NVIDIA Tools Extension) lets you mark regions of your code with named ranges that appear as colored bands in the Nsight Systems timeline. Without annotations, the timeline shows opaque CUDA kernel names that are impossible to map to your training logic.
In PyTorch, you use torch.cuda.nvtx.range_push("name") and torch.cuda.nvtx.range_pop() to wrap regions. You add annotations at the training loop level (forward, backward, optimizer step, data loading) and at the module level for complex models. When you open the nsys timeline, each phase has its own band, making it trivial to see where time is actually going.
A practical pattern is to annotate at two levels: the training loop level with step numbers, and within each step at the phase level. This lets you zoom in on an individual step and see the exact breakdown.
Q5: How would you profile a production training job without disrupting it?
First choice: use the PyTorch profiler with schedule(skip_first=N, wait=1, warmup=1, active=3, repeat=1). This profiles only 3 steps during a window deep into training (after warmup), writes a trace file, and then stops collecting with near-zero ongoing overhead.
Second choice: capture a short profiling window with nsys profile --delay 60 --duration 30 python train.py. This waits 60 seconds (letting the job warm up), profiles for 30 seconds, then detaches.
What I would never do: profile the entire training run with --set full in Nsight Compute. That replays kernels 20x and would make an overnight job run for weeks.
Also important: run the profiling job with the same batch size, sequence length, and hardware configuration as production. A profile from a different configuration often gives misleading results.
Q6: What is the roofline model and what does it tell you from a Nsight Compute report?
The roofline model plots your kernel's achieved FLOP/s against its arithmetic intensity (FLOP per byte of memory traffic). The "roof" has two segments: a sloped segment representing the memory bandwidth limit and a flat segment representing the compute limit. The intersection is the balance point.
A kernel below the sloped segment is memory-bound - it is not even saturating available bandwidth. A kernel near or above the sloped segment but below the flat segment is bandwidth-limited. A kernel near the flat segment is compute-limited.
In Nsight Compute, the roofline chart plots your kernel as a dot and shows these ceilings. If the dot is far below the memory-bandwidth slope, the kernel is either not coalescing accesses efficiently, has unnecessary memory traffic, or is stalling so severely that bandwidth is not sustained. If the dot is near the top (flat ceiling), the kernel is well-optimized for its arithmetic intensity.
The key insight: to move the dot upward, you either improve efficiency (reduce stalls, fix coalescing) to approach the existing roof, or you change the algorithm to increase arithmetic intensity (tiling, fusion) so the applicable roof is higher.
Profiling Multi-GPU Workloads
Single-GPU profiling generalizes to multi-GPU, but there are additional layers to understand.
NCCL Operations in Nsight Systems
In distributed training, NCCL collectives (AllReduce, AllGather, ReduceScatter) are major contributors to step time. Nsight Systems captures NCCL operations when you add nvtx and cuda to the trace:
# Profile a distributed training job on one rank
RANK=0 nsys profile \
--trace=cuda,nvtx,nccl \
--output=dist_profile_rank0 \
torchrun --nproc_per_node=1 train_distributed.py
In the timeline, look for:
- AllReduce duration: how long gradient synchronization takes
- AllReduce start skew: are all ranks starting the AllReduce at the same time? If one rank arrives late (e.g., due to uneven batch sizes or load imbalance), all ranks wait for it. The latecomer is the bottleneck.
- Compute-communication overlap: in DDP with
gradient_as_bucket_view=True, gradient AllReduce should overlap with backward pass computation for earlier layers. If the timeline shows AllReduce only starting after the entire backward pass completes, the overlap is broken.
Annotating Distributed Steps
import torch
import torch.distributed as dist
import torch.cuda.nvtx as nvtx
def train_step_distributed(model, optimizer, inputs, labels, rank):
nvtx.range_push(f"rank{rank}_forward")
outputs = model(inputs)
loss = criterion(outputs, labels)
nvtx.range_pop()
nvtx.range_push(f"rank{rank}_backward")
loss.backward()
# DDP AllReduce happens during backward (gradient hooks)
nvtx.range_pop()
nvtx.range_push(f"rank{rank}_optimizer")
optimizer.step()
optimizer.zero_grad()
nvtx.range_pop()
Profile all ranks simultaneously and compare their timelines. Imbalance shows up as one rank's compute regions being longer than others before the AllReduce starts.
Profiling Regression Detection in CI
A kernel that was 1.2ms yesterday is 1.9ms today. Something regressed. Without automated profiling, this goes undetected until someone notices the training run is slow.
Lightweight CI Profiling with PyTorch
import torch
from torch.profiler import profile, ProfilerActivity
def profile_kernel_for_ci(fn, *args, n_warmup=3, n_measure=10):
"""
Returns a dict of key metrics suitable for CI comparison.
"""
# Warmup
for _ in range(n_warmup):
fn(*args)
torch.cuda.synchronize()
# Measure with profiler
with profile(
activities=[ProfilerActivity.CUDA],
record_shapes=False,
) as prof:
for _ in range(n_measure):
fn(*args)
# Extract key metric: average CUDA time of the target op
stats = prof.key_averages()
cuda_time_avg = sum(e.cuda_time_total for e in stats) / n_measure
return {
"cuda_time_ms": cuda_time_avg / 1000, # convert us to ms
"n_kernels": len(stats),
}
# In your CI test:
def test_attention_kernel_performance():
q = torch.randn(4, 8, 512, 64, device="cuda", dtype=torch.float16)
k = torch.randn(4, 8, 512, 64, device="cuda", dtype=torch.float16)
v = torch.randn(4, 8, 512, 64, device="cuda", dtype=torch.float16)
metrics = profile_kernel_for_ci(my_attention, q, k, v)
# Assert performance threshold
assert metrics["cuda_time_ms"] < 5.0, (
f"Attention kernel regressed: {metrics['cuda_time_ms']:.2f}ms "
f"(threshold: 5.0ms)"
)
Store the metrics in a JSON artifact per CI run and plot them over time. A sudden jump in cuda_time_ms on a PR is a performance regression. Catching it at review time is far cheaper than debugging it in production three weeks later.
What to Include in Performance Regression Tests
- Wall-clock time per step (end-to-end training step, not just one kernel)
- Peak memory usage (
torch.cuda.max_memory_allocated()) - Number of kernel launches per step (regressions often come from accidentally adding extra launches)
- Target kernel time for the most critical kernels (attention, GEMM, custom ops)
def get_step_metrics(model, optimizer, inputs, labels):
torch.cuda.reset_peak_memory_stats()
torch.cuda.synchronize()
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
outputs = model(inputs)
loss = outputs.sum()
loss.backward()
optimizer.step()
optimizer.zero_grad()
end_event.record()
torch.cuda.synchronize()
return {
"step_time_ms": start_event.elapsed_time(end_event),
"peak_memory_gb": torch.cuda.max_memory_allocated() / 1e9,
}
Summary
Profiling is not optional in GPU engineering - it is the core skill that separates engineers who optimize by evidence from engineers who optimize by hope. Nsight Systems shows you the pipeline: where the GPU is idle, where transfers serialize, where the DataLoader starves compute. Nsight Compute shows you the kernel: why it is fast or slow, what the bottleneck is, and what to change. The PyTorch profiler gives you Python-level visibility into operator timing and memory allocation.
The workflow is always the same: profile with nsys to find the bottleneck in the pipeline, profile with ncu to diagnose the kernel, apply a principled fix, re-profile to confirm the improvement. Never optimize without a profiler. Never claim a kernel is optimized without showing the before-and-after metrics. The GPU has the receipts - profiling lets you read them.
