Skip to main content

Profiling Python and C Code

The Engineer Who Optimized the Wrong Thing for Three Weeks

The training run was slow. The team knew it was slow. A model that should train in 6 hours was taking 19 hours. The senior ML engineer looked at the training loop, spotted a complex attention score normalization that involved several Python operations, and spent three weeks rewriting it in Cython. After shipping the optimization, the training run took 18.7 hours. Eighteen and a half minutes saved on a 19-hour job.

Meanwhile, a new engineer ran nvidia-smi dmon during training and noticed something odd: GPU utilization was oscillating between 3% and 97% on a 4-second cycle. She pulled the data loading metrics. The DataLoader was fetching images from an NFS mount across a network link, processing them with a pure Python augmentation pipeline using 2 workers, and then blocking the training loop while the GPU sat idle waiting for the next batch. The bottleneck was not the attention computation. It was the data pipeline.

Four hours of work - increasing DataLoader workers, moving augmentation to GPU via Kornia, and adding a prefetch buffer - brought training time down to 5.8 hours. That is a 3.3x improvement. The three-week Cython rewrite had found a 0.2% improvement.

This story repeats constantly in ML engineering. Experienced teams have internalized the lesson. Do not optimize code you feel is slow. Measure code to find what is actually slow. The profiler is not a tool you reach for after you have formed a hypothesis. It is the tool you use to form the hypothesis in the first place.

This lesson covers the complete profiling stack for Python and ML code - from high-level call-graph profilers to line-level CPU profilers to GPU kernel tracers. Each tool has a specific role. Knowing which one to reach for first is as important as knowing how to use it.


Why This Exists - The Cost of Intuition

Human intuition about performance is systematically wrong. Studies of expert programmers have shown that even experienced engineers correctly identify the performance bottleneck before profiling less than 20% of the time. The reasons are structural.

We reason about code as we write it - sequentially, function by function. But a running program's performance is determined by frequency (how often is this called?) multiplied by cost (how expensive is each call?). A function that looks expensive may be called once per day. A function that looks trivial may be called 10 million times per inference.

Profilers exist because they capture the actual execution - not the code as written, but the code as running. They measure what the CPU spent time on, what functions allocated memory, what operations blocked the GPU, and what network calls happened. The profiler replaces intuition with evidence.

The philosophical rule is: measure first, optimize second, measure again to verify. Every optimization that is not bookended by measurements is speculation.


Historical Context - From gprof to Continuous Profiling

The first widely-used profiler was gprof, introduced in 1982 with the Berkeley Unix distribution. It used a combination of statistical sampling (interrupt the program every N microseconds and record the current instruction pointer) and exact call counting (instrument every function call/return). The output was a flat profile and a call graph.

Python's ecosystem built its own profiling infrastructure. The profile module (deterministic, Python-only) appeared in Python 1.x. cProfile replaced it in Python 2.5 - it uses the same hooks but is implemented in C, reducing profiling overhead from ~100x to ~5-10x overhead.

The sampling profiler approach - which gprof pioneered - was reinvented for Python as py-spy (2018) and before it pyflame (2016). Sampling profilers attach to a running process and periodically read the call stack. They have near-zero overhead and work on production processes without code changes.

The ML training profiler, torch.profiler, appeared in PyTorch 1.8 (2021). It unified CPU and GPU timeline profiling with Chrome trace format output, enabling millisecond-level analysis of what GPUs are actually doing during training.


Profiling Philosophy - The Measurement Protocol

Before touching any profiling tool, establish the measurement protocol:

  1. Establish a reproducible benchmark: The same inputs, same hardware, same environment. Profiling a function that takes different inputs each run produces noise.

  2. Isolate the workload: Profile the specific path you care about. Do not profile an entire application startup and try to read the results for a single endpoint.

  3. Run enough iterations: A function that takes 100 microseconds needs many invocations to produce a stable measurement. The profiler itself has overhead - you want signal above the noise floor.

  4. Profile at the right level: Start with a coarse call-graph profiler to identify which function to investigate. Then switch to a line-level profiler for that specific function.

  5. Verify the fix: After optimizing, re-run the same profiler with the same benchmark. Confirm the time went down where expected and did not increase elsewhere.


cProfile - The Standard Call-Graph Profiler

cProfile is Python's built-in deterministic profiler. It instruments every function call and return, recording exact call counts and accumulated time.

Understanding cumtime vs tottime

The most important output columns are:

  • tottime: time spent IN this function, excluding time in subfunctions
  • cumtime: cumulative time - time in this function INCLUDING all subfunctions it called
  • ncalls: number of times the function was called
# profile_driver.py
import cProfile
import pstats
import io

def process_batch(data):
features = extract_features(data)
normalized = normalize(features)
return predict(normalized)

def extract_features(data):
# Imagine this is slow
import time
time.sleep(0.1)
return data

def normalize(features):
return [x / max(features) for x in features]

def predict(normalized):
return sum(normalized)

# Method 1: Context manager (cleanest for targeted profiling)
data = list(range(1000))

profiler = cProfile.Profile()
profiler.enable()

for _ in range(10):
process_batch(data)

profiler.disable()

# Print formatted stats
stream = io.StringIO()
stats = pstats.Stats(profiler, stream=stream)
stats.strip_dirs() # remove long path prefixes
stats.sort_stats("cumulative") # sort by cumulative time
stats.print_stats(20) # show top 20 functions
print(stream.getvalue())
# Example output:
# 200 function calls in 1.003 seconds
#
# ncalls tottime percall cumtime percall filename:lineno(function)
# 10 0.000 0.000 1.003 0.100 profile_driver.py:5(process_batch)
# 10 1.003 0.100 1.003 0.100 profile_driver.py:11(extract_features)
# 10 0.000 0.000 0.000 0.000 profile_driver.py:17(normalize)
# 10 0.000 0.000 0.000 0.000 profile_driver.py:20(predict)

Automated Profiling Wrapper

# profiling_utils.py
import cProfile
import pstats
import io
import functools
from typing import Callable, TypeVar, Any

F = TypeVar("F", bound=Callable[..., Any])

def profile(n_calls: int = 1, sort_by: str = "cumulative", top_n: int = 20):
"""Decorator that profiles a function and prints stats."""
def decorator(func: F) -> F:
@functools.wraps(func)
def wrapper(*args, **kwargs):
profiler = cProfile.Profile()
profiler.enable()
result = None
for _ in range(n_calls):
result = func(*args, **kwargs)
profiler.disable()

stream = io.StringIO()
stats = pstats.Stats(profiler, stream=stream)
stats.strip_dirs()
stats.sort_stats(sort_by)
stats.print_stats(top_n)
print(f"\n--- Profile: {func.__name__} ({n_calls} calls) ---")
print(stream.getvalue())
return result
return wrapper # type: ignore
return decorator

@profile(n_calls=100, sort_by="tottime", top_n=10)
def my_expensive_function(data):
return [x**2 for x in data]

# Usage
my_expensive_function(list(range(10000)))

Saving and Loading Profile Data

# Save profile data for later analysis
profiler = cProfile.Profile()
profiler.enable()
run_workload()
profiler.disable()
profiler.dump_stats("profile_output.prof")

# Load and analyze later
stats = pstats.Stats("profile_output.prof")
stats.sort_stats("cumulative")
stats.print_callers("slow_function") # who called slow_function?
stats.print_callees("slow_function") # what did slow_function call?

line_profiler - Line-by-Line Timing

cProfile tells you which function is slow. line_profiler tells you which line inside that function is slow. They are complementary tools, not alternatives.

pip install line_profiler
# linalg_ops.py
from line_profiler import profile # Python 3.12+ style import

@profile
def cosine_similarity_batch(queries, keys):
"""Compute cosine similarity between query batch and key matrix."""
import numpy as np

# Normalize queries
q_norms = np.linalg.norm(queries, axis=1, keepdims=True)
queries_normalized = queries / q_norms # line A

# Normalize keys
k_norms = np.linalg.norm(keys, axis=1, keepdims=True)
keys_normalized = keys / k_norms # line B

# Compute dot products
scores = queries_normalized @ keys_normalized.T # line C

return scores

# Run with: kernprof -l -v linalg_ops.py
# Or programmatically:
import line_profiler
import numpy as np

lp = line_profiler.LineProfiler()
lp.add_function(cosine_similarity_batch)

queries = np.random.randn(64, 512).astype(np.float32)
keys = np.random.randn(10000, 512).astype(np.float32)

lp.runcall(cosine_similarity_batch, queries, keys)
lp.print_stats()
# Example output:
# Timer unit: 1e-06 s
#
# Total time: 0.045123 s
# File: linalg_ops.py
# Function: cosine_similarity_batch at line 4
#
# Line # Hits Time Per Hit % Time Line Contents
# =========================================================
# 9 1 1823.0 1823.0 4.0 q_norms = np.linalg.norm(...)
# 10 1 812.0 812.0 1.8 queries_normalized = queries / q_norms
# 13 1 1956.0 1956.0 4.3 k_norms = np.linalg.norm(...)
# 14 1 798.0 798.0 1.8 keys_normalized = keys / k_norms
# 17 1 39734.0 39734.0 88.1 scores = queries_normalized @ keys_normalized.T

Line 17 (the matrix multiply) dominates. That is where to focus.


py-spy - The Zero-Overhead Sampling Profiler

py-spy is a sampling profiler written in Rust. It attaches to a running Python process, periodically reads the interpreter's call stack from process memory, and aggregates the samples. No code changes required. No restart needed. Near-zero overhead (typically under 1%).

pip install py-spy

# Profile a running process (requires root or ptrace capability)
py-spy top --pid 12345

# Record a flame graph for a script
py-spy record -o profile.svg -- python my_training_script.py

# Dump current stack of a running process (good for diagnosing hangs)
py-spy dump --pid 12345

# Profile with subprocesses (e.g., DataLoader workers)
py-spy record --subprocesses -o profile.svg -- python train.py

Why py-spy Is Invaluable for Production Profiling

cProfile and line_profiler require modifying your code and restarting your process. In production, or during a long-running training job, you cannot do that. py-spy attaches to the running process externally, so you can profile a job that has been running for 6 hours without interrupting it.

The flame graph output is particularly useful: functions are shown as rectangles, width represents time, and the stack is shown vertically. Wide rectangles in the middle of a call chain are bottlenecks.

# py-spy can also be invoked programmatically within a script
# using the subprocessing approach for automation

import subprocess
import time

# Start your process
proc = subprocess.Popen(["python", "train.py"])

# Wait for warmup
time.sleep(30)

# Attach py-spy
subprocess.run([
"py-spy", "record",
"--pid", str(proc.pid),
"--duration", "60", # profile for 60 seconds
"--output", "profile.svg",
"--format", "flamegraph"
])

Scalene - CPU, Memory, and GPU Profiling in One Tool

Scalene is the most comprehensive Python profiler available. It profiles CPU time (distinguishing Python vs C vs GPU), memory allocation by line, and GPU utilization - all simultaneously with a sampling approach.

pip install scalene

# Profile a script with web UI
scalene my_training_script.py

# Profile without web UI (print to terminal)
scalene --cli my_training_script.py

# Profile with memory tracking
scalene --memory my_training_script.py

# Profile only a section of code
scalene --profile-interval 0.01 my_training_script.py
# scalene_demo.py
# Scalene distinguishes Python overhead from native code time

import numpy as np
from scalene import scalene_profiler

def python_heavy_function(n):
"""This is dominated by Python interpreter overhead."""
result = []
for i in range(n):
result.append(i * i) # Python integer multiplication
return result

def numpy_function(n):
"""This runs mostly in C/Fortran (NumPy internals)."""
arr = np.arange(n)
return arr * arr # NumPy vectorized operation

# Scalene output will show:
# python_heavy_function: 95% Python time, 5% native time
# numpy_function: 5% Python time, 95% native time (NumPy C code)

# This distinction tells you where Cython or NumPy vectorization helps
# (Python-heavy functions) vs where you need algorithmic improvements
# (native-heavy functions that are already using C)

scalene_profiler.start()
python_heavy_function(1_000_000)
numpy_function(1_000_000)
scalene_profiler.stop()

Scalene's output divides each line's time into three categories:

  • Python %: time executing Python bytecode (eliminatable with Cython/NumPy)
  • native %: time executing C/Fortran/CUDA code (already compiled)
  • GPU %: time executing GPU kernels

This distinction is not available in cProfile or line_profiler, which conflate Python and native execution time.


C-Level Profiling with perf

When your bottleneck is inside a C extension (NumPy, PyTorch native ops, Cython code), Python-level profilers cannot see inside those functions. You need a system-level profiler.

Linux perf

# Install perf
sudo apt-get install linux-tools-common linux-tools-$(uname -r)

# Record CPU samples for a Python process
# -g = capture call graphs, --call-graph dwarf = use DWARF debug info
sudo perf record -g --call-graph dwarf -- python -m torch.utils.bottleneck train.py

# Generate human-readable report
sudo perf report

# Generate flame graph (requires FlameGraph scripts)
git clone https://github.com/brendangregg/FlameGraph
sudo perf script | ./FlameGraph/stackcollapse-perf.pl | \
./FlameGraph/flamegraph.pl > perf_flamegraph.svg

For Python C extensions, you need debug symbols. Build with -g -O2 (optimization but with debug info):

# setup.py for perf-compatible Cython build
Extension(
name="fast_math",
sources=["fast_math.pyx"],
extra_compile_args=["-O2", "-g"], # optimization + debug symbols
)

Annotating Python with perf

Python 3.12+ added support for perf profiling with function-level annotations:

# Python 3.12+ with perf support
python -X perf -m torch.utils.bottleneck train.py &
PID=$!
sudo perf record -g -p $PID

Valgrind and Callgrind for C/Cython Code

Valgrind's Callgrind tool provides exact instruction-level profiling - not sampling-based, but counting every instruction. It is 20-100x slower than the actual program but gives precise measurements useful for micro-optimization.

# Profile with Callgrind
valgrind --tool=callgrind \
--callgrind-out-file=callgrind.out \
python my_script.py

# View results with KCachegrind (Linux) or QCachegrind (macOS)
kcachegrind callgrind.out

# Or parse with callgrind_annotate
callgrind_annotate callgrind.out

Callgrind output shows:

  • Instruction count per function (Ir column)
  • Cache miss rates (D1 miss, LL miss)
  • Branch misprediction rates
# Detect memory errors in C extensions (critical before shipping)
valgrind --tool=memcheck \
--leak-check=full \
--show-leak-kinds=all \
python my_script.py

# gprof for compiled C/C++ programs (not Python extensions)
# Compile with -pg flag
gcc -pg -O2 -o my_program my_program.c
./my_program # generates gmon.out
gprof my_program gmon.out > analysis.txt

Profiling ML Training Loops

ML training has unique profiling challenges: the bottleneck can be in the Python orchestration, the data pipeline, CPU preprocessing, GPU computation, or inter-GPU communication. Standard Python profilers only see part of the picture.

torch.profiler - The Complete ML Profiler

PyTorch's profiler traces CPU and GPU operations simultaneously, producing output compatible with Chrome's trace viewer and TensorBoard.

# torch_profiler_example.py
import torch
import torch.nn as nn
from torch.profiler import profile, ProfilerActivity, record_function

# Simple model for demonstration
model = nn.Sequential(
nn.Linear(512, 2048),
nn.ReLU(),
nn.Linear(2048, 512),
nn.ReLU(),
nn.Linear(512, 10),
).cuda()

optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()

data = torch.randn(64, 512, device="cuda")
labels = torch.randint(0, 10, (64,), device="cuda")

# Profile with Chrome trace export
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=True, # record tensor shapes (useful for debugging)
profile_memory=True, # track memory allocations
with_stack=True, # capture Python call stacks
on_trace_ready=torch.profiler.tensorboard_trace_handler("./profiler_logs"),
) as prof:
for step in range(20):
with record_function("forward_pass"):
output = model(data)

with record_function("loss_computation"):
loss = loss_fn(output, labels)

with record_function("backward_pass"):
optimizer.zero_grad()
loss.backward()

with record_function("optimizer_step"):
optimizer.step()

prof.step() # CRITICAL: advance the profiler at each step

# Print key averages grouped by operation
print(prof.key_averages().table(
sort_by="cuda_time_total",
row_limit=20
))

Profiling with Schedule (Warmup, Active, Repeat)

For long training runs, you do not want to profile every step - the overhead is too high. Use the schedule API:

# torch_scheduled_profiler.py
from torch.profiler import schedule

# Profile steps 2-6 out of every 10:
# step 0-1: warmup (profiler is warming up, data not collected)
# step 2-6: active (data collected)
# step 7-9: repeat cycle (skip profiling, let training run normally)
profiler_schedule = schedule(
skip_first=1,
wait=1,
warmup=1,
active=3,
repeat=2
)

with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
schedule=profiler_schedule,
on_trace_ready=torch.profiler.tensorboard_trace_handler("./logs/profiler"),
) as prof:
for step, (data, labels) in enumerate(dataloader):
train_step(model, data, labels, optimizer, loss_fn)
prof.step()
if step >= 20:
break

# View results in TensorBoard:
# tensorboard --logdir ./logs/profiler

Reading Profiler Output

# After profiling, analyze key averages
key_avgs = prof.key_averages(group_by_input_shape=True)

# Sort by self CUDA time (time inside the op, excluding called ops)
print(key_avgs.table(sort_by="self_cuda_time_total", row_limit=10))

# Sort by CPU time
print(key_avgs.table(sort_by="cpu_time_total", row_limit=10))

# Export Chrome trace for visual inspection
prof.export_chrome_trace("trace.json")
# Open chrome://tracing in Chrome, load trace.json
# You will see a timeline of every CPU and GPU operation

CUDA Event Timing - Precise GPU Measurements

The Python time.time() function cannot accurately measure GPU operations because GPU kernels are launched asynchronously. The CPU returns from the launch call immediately; the GPU runs the kernel later. To time GPU operations accurately, use CUDA events.

# cuda_timing.py
import torch
import time

def time_gpu_operation_wrong(func, *args, n_runs=100):
"""WRONG: CPU time includes only kernel launch, not execution."""
start = time.time()
for _ in range(n_runs):
func(*args)
end = time.time()
return (end - start) / n_runs * 1000 # ms

def time_gpu_operation_correct(func, *args, n_runs=100, n_warmup=10):
"""CORRECT: Uses CUDA events that synchronize with GPU execution."""
# Warmup runs (GPU JIT compilation, cache warmup)
for _ in range(n_warmup):
func(*args)
torch.cuda.synchronize() # wait for all GPU work to complete

# Timed runs
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)

start_event.record()
for _ in range(n_runs):
func(*args)
end_event.record()

# synchronize() blocks until the end_event is reached on the GPU
torch.cuda.synchronize()

elapsed_ms = start_event.elapsed_time(end_event)
return elapsed_ms / n_runs

# Example usage
a = torch.randn(4096, 4096, device="cuda")
b = torch.randn(4096, 4096, device="cuda")

matmul_time = time_gpu_operation_correct(torch.mm, a, b)
print(f"Matrix multiply (4096x4096): {matmul_time:.3f} ms")

# Custom context manager for GPU timing
class CUDATimer:
def __init__(self, name: str, n_warmup: int = 3):
self.name = name
self.n_warmup = n_warmup
self.start = torch.cuda.Event(enable_timing=True)
self.end = torch.cuda.Event(enable_timing=True)
self._warmup_count = 0

def __enter__(self):
if self._warmup_count >= self.n_warmup:
self.start.record()
return self

def __exit__(self, *args):
if self._warmup_count >= self.n_warmup:
self.end.record()
torch.cuda.synchronize()
elapsed = self.start.elapsed_time(self.end)
print(f"{self.name}: {elapsed:.3f} ms")
self._warmup_count += 1

# Usage
with CUDATimer("attention_forward"):
attention_output = attention_layer(x)

Custom Profiling Context Manager

For production training code, it is useful to have a lightweight context manager that you can leave in the codebase and toggle with a flag:

# profiling_context.py
import time
import contextlib
import functools
import logging
from typing import Optional, Dict, List

logger = logging.getLogger(__name__)

class ProfilingContext:
"""Lightweight context manager for tracking stage timing."""

def __init__(self, enabled: bool = True):
self.enabled = enabled
self.timings: Dict[str, List[float]] = {}

@contextlib.contextmanager
def track(self, name: str):
if not self.enabled:
yield
return

start = time.perf_counter()
try:
yield
finally:
elapsed = (time.perf_counter() - start) * 1000 # ms
if name not in self.timings:
self.timings[name] = []
self.timings[name].append(elapsed)

def summary(self) -> str:
lines = ["--- Timing Summary ---"]
for name, times in sorted(self.timings.items()):
avg = sum(times) / len(times)
total = sum(times)
n = len(times)
lines.append(f" {name}: avg={avg:.2f}ms, total={total:.0f}ms, n={n}")
return "\n".join(lines)

# Typical usage in a training loop
profiler = ProfilingContext(enabled=True)

for epoch in range(num_epochs):
for batch_idx, batch in enumerate(dataloader):
with profiler.track("data_loading"):
data, labels = batch[0].cuda(), batch[1].cuda()

with profiler.track("forward_pass"):
output = model(data)

with profiler.track("loss_and_backward"):
loss = criterion(output, labels)
optimizer.zero_grad()
loss.backward()

with profiler.track("optimizer_step"):
optimizer.step()

if batch_idx % 100 == 0:
print(profiler.summary())

Identifying Real Bottlenecks in ML Training

ML training performance problems fall into several categories. Knowing the categories helps you choose the right profiler.

The GPU Utilization Check

Before reaching for any profiler, check GPU utilization:

# Watch GPU utilization every 100ms
watch -n 0.1 nvidia-smi

# More detailed: GPU utilization, memory, power
nvidia-smi dmon -s u -d 1

# For multi-GPU: check all GPUs
nvidia-smi --query-gpu=index,utilization.gpu,utilization.memory,memory.used \
--format=csv --loop=1

If GPU utilization is consistently below 85%, you have a bottleneck outside the GPU kernels:

GPU UtilizationLikely Bottleneck
Below 30%Data loading is starving the GPU
30-60% (oscillating)Python preprocessing or CPU augmentation
60-85%Model architecture has too many small ops or Python control flow
Above 85%GPU is the bottleneck - this is good

Data Loading Bottleneck Detection

# data_loading_benchmark.py
import torch
import time
from torch.utils.data import DataLoader

def benchmark_dataloader(dataset, num_workers_list, batch_size=64):
"""Compare DataLoader performance across different num_workers settings."""
results = {}
for nw in num_workers_list:
loader = DataLoader(
dataset,
batch_size=batch_size,
num_workers=nw,
pin_memory=True, # for CUDA transfer speed
prefetch_factor=2 if nw > 0 else None,
persistent_workers=True if nw > 0 else False,
)
# Warmup
for batch in loader:
break

start = time.perf_counter()
for _ in loader:
pass
elapsed = time.perf_counter() - start

results[nw] = elapsed
print(f"num_workers={nw}: {elapsed:.2f}s")

return results

# If time drops significantly as num_workers increases, data loading is your bottleneck
# If time does not change, the bottleneck is elsewhere (GPU compute, or preprocessing)
benchmark_dataloader(your_dataset, num_workers_list=[0, 2, 4, 8, 16])

Using torch.utils.bottleneck

PyTorch provides a convenience script that runs cProfile and autograd profiler together:

# Quick bottleneck report
python -m torch.utils.bottleneck train_one_step.py

# The output shows:
# 1. Environment info (CUDA version, PyTorch version)
# 2. cProfile results for CPU operations
# 3. Autograd profiler results for CUDA operations

The Identify-and-Fix Workflow

Here is the complete workflow from a real bottleneck diagnosis:

# bottleneck_demo.py
# Simulates a training loop with identifiable bottlenecks

import numpy as np
import torch
import torch.nn as nn
import time
from torch.profiler import profile, ProfilerActivity, record_function

class SlowModel(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(512, 1024)
self.fc2 = nn.Linear(1024, 10)

def forward(self, x):
# Bottleneck 1: unnecessary Python loop
results = []
for i in range(x.shape[0]): # should be: x = self.fc1(x) directly
results.append(self.fc1(x[i:i+1]))
x = torch.cat(results, dim=0)

# Bottleneck 2: unnecessary CPU-GPU transfer
x_cpu = x.cpu().numpy() # moves to CPU
x_cpu = np.maximum(x_cpu, 0) # ReLU on CPU
x = torch.from_numpy(x_cpu).cuda() # moves back to GPU

x = self.fc2(x)
return x

class FastModel(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(512, 1024)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(1024, 10)

def forward(self, x):
x = self.fc1(x) # batch operation, no Python loop
x = self.relu(x) # stays on GPU
x = self.fc2(x)
return x

# Profile both to demonstrate the difference
data = torch.randn(64, 512).cuda()

with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
with record_function("slow_model"):
slow_out = SlowModel().cuda()(data)

print("--- SlowModel ---")
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
with record_function("fast_model"):
fast_out = FastModel().cuda()(data)

print("--- FastModel ---")
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

Production Engineering Notes

Continuous Profiling in Production

For services (inference APIs, data processing pipelines), run py-spy continuously and export flame graphs to a monitoring system. Tools like Datadog APM, Pyroscope, and Google Cloud Profiler support continuous profiling for Python services.

# Pyroscope continuous profiling integration
import pyroscope

pyroscope.configure(
application_name="ml-inference-service",
server_address="http://pyroscope:4040",
tags={"model": "resnet50", "version": "v2"},
)

# From this point forward, Pyroscope samples the process continuously
# and uploads flame graphs to the Pyroscope server

Profiling DataLoader Workers

DataLoader workers run in separate processes. py-spy with --subprocesses flag traces all of them:

py-spy record --subprocesses --output dataloader_profile.svg \
-- python train.py --num-workers 8

Memory Profiling with memory_profiler

# memory_demo.py
from memory_profiler import profile

@profile
def load_and_process_embeddings(filepath):
import numpy as np
# Line-by-line memory usage
embeddings = np.load(filepath) # memory spike here
normalized = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
index = build_faiss_index(normalized) # index may be larger than embeddings
return index

# Run with: python -m memory_profiler memory_demo.py
# Or: mprof run memory_demo.py && mprof plot

Common Mistakes

:::danger Profiling Without Warmup on GPU Code

# WRONG - first run includes CUDA JIT compilation time
start = time.time()
output = model(data)
torch.cuda.synchronize()
elapsed = time.time() - start
print(f"Forward pass: {elapsed*1000:.1f} ms") # WRONG: includes 100-500ms JIT overhead

# RIGHT - run several warmup iterations before measuring
for _ in range(5):
model(data)
torch.cuda.synchronize()

start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
for _ in range(100):
output = model(data)
end.record()
torch.cuda.synchronize()
print(f"Forward pass: {start.elapsed_time(end)/100:.2f} ms")

CUDA kernels are JIT-compiled on first use. Including the first run in your timing measurement gives you a number 10-100x higher than steady-state performance. :::

:::danger Forgetting torch.cuda.synchronize() Before Stopping the Timer

# WRONG - measures kernel launch time, not kernel execution time
start = time.time()
output = model(data)
elapsed = time.time() - start # kernel may still be running on GPU

# RIGHT
start = time.time()
output = model(data)
torch.cuda.synchronize() # block until GPU finishes all queued work
elapsed = time.time() - start

GPU operations are asynchronous. The CPU launches a kernel and immediately returns. Without synchronize(), your timer captures only the kernel launch latency (microseconds), not the actual computation time (milliseconds). :::

:::warning cProfile Has 5-10x Overhead on Function-Call-Heavy Code

# WRONG approach: profiling with cProfile in production
# cProfile instruments every function call and return
# For code with millions of small function calls, this distorts the profile

# If your code calls millions of tiny functions, cProfile overhead itself
# becomes a significant fraction of the measured time

# BETTER: use py-spy (sampling, near-zero overhead)
# py-spy record -o profile.svg -- python production_script.py

# Or: profile with cProfile but only on a reduced workload
# to get relative rankings, not absolute times

:::

:::warning Memory Profiler Slows Execution 10-50x

memory_profiler instruments every line with memory measurement calls. It is extremely useful for diagnosing memory growth, but the overhead makes timing numbers from memory-profiled code meaningless. Use memory profiler and timing profilers on separate runs, never simultaneously. :::


Interview Q&A

Q: What is the difference between cumtime and tottime in cProfile output, and when does each matter?

A: tottime (total time) is the time spent executing the body of that specific function, excluding any time spent in functions it called. cumtime (cumulative time) includes the function's own time plus all time spent in every function it called directly or indirectly. tottime identifies where the CPU is actually spending cycles - a function with high tottime is doing expensive work itself. cumtime helps you understand the overall cost of a call path - a function with high cumtime but low tottime is an orchestrator that is calling expensive things. In practice: sort by cumtime first to find the expensive call chains, then look at tottime to find where in that chain the actual computation is happening. A function with high cumtime but tottime near zero means the cost is in its callees, so you need to look one level deeper.

Q: Why do you need CUDA events to measure GPU operation timing, and why does time.time() give wrong results?

A: GPU kernels execute asynchronously. When you call model(data), the CPU submits the compute operations to the GPU's command queue and returns immediately. The GPU is still processing these operations after the CPU has moved on. If you call time.time() right after model(data) without synchronizing, you measure only the time to enqueue the work - typically microseconds for kernel launches - not the actual execution time which may be milliseconds. CUDA events solve this by creating markers in the GPU's command stream. start.record() inserts a timestamp marker before your operations in the GPU stream. end.record() inserts one after. torch.cuda.synchronize() blocks the CPU until the GPU has processed all commands including the end marker. Then start.elapsed_time(end) reads the GPU hardware timestamps and returns the actual GPU execution time. This is the only accurate way to time GPU operations.

Q: When would you use py-spy instead of cProfile, and what is the key technical difference between them?

A: Use py-spy when you cannot modify the code (production processes), when cProfile's overhead distorts the results (very frequent function calls), when you need to profile a long-running job without restarting it, or when you want to diagnose a hang or deadlock in a running process. The technical difference is the profiling mechanism. cProfile is a deterministic profiler - it instruments every function call and return by registering Python trace hooks. It counts exactly how many times each function is called and accumulates exact time. py-spy is a sampling profiler - it runs as a separate process, attaches to the target via ptrace, and periodically reads the target's call stack from memory. It does not modify the target process at all. Sampling means occasional calls are missed (statistical noise), but overhead is near zero. cProfile gives exact counts but adds 5-10x overhead. py-spy gives approximate distributions with under 1% overhead. For production profiling, py-spy is almost always the right choice.

Q: Scalene shows a function as 90% native time. What does this tell you about optimization strategies?

A: "90% native time" means the function is spending 90% of its time executing compiled C, C++, Fortran, or CUDA code rather than Python bytecode. For NumPy-heavy code, this means the bottleneck is inside NumPy's C/Fortran routines - the Python overhead of calling NumPy is already minimal. For Cython code, it means the Cython compilation is working well. The optimization implications are: you cannot improve this by adding more type declarations or rewriting Python loops in Cython, because the Python portion is already small. Instead you need to look at the algorithm (are you doing unnecessary work?), the data layout (cache efficiency of the NumPy operations), or whether a different library (like cuBLAS for matrix operations) is faster. If the function is doing a matrix multiply that is 90% native time, the relevant question is "is this the most efficient matrix multiply available?" not "how do I reduce Python overhead?"

Q: Describe the identify-and-fix workflow for an ML training loop that is slower than expected.

A: Start with the coarsest measure first. Check nvidia-smi to see GPU utilization. If it is below 80%, you are not GPU-bound - the bottleneck is elsewhere. Profile the data pipeline with a DataLoader-only loop (no model forward pass) to see how fast you can load and preprocess data in isolation. If that alone consumes most of the expected step time, fix the data pipeline first: increase num_workers, use pin_memory=True, add prefetching, move augmentation to GPU. If data loading is not the problem, run torch.profiler for a few steps and export the Chrome trace. Open the trace and look for long CPU-side pauses between GPU kernels - these indicate Python overhead, unnecessary CPU-GPU transfers, or synchronization points. Look for which GPU operations have high self_cuda_time_total - those are the most expensive kernels. Use record_function labels to tag logical sections (forward, loss, backward, optimizer) and see where time is going at a high level. Then drill into the expensive section with line_profiler for the Python parts or Nsight Systems for the CUDA kernel details. Verify every fix by re-running the same profiler benchmark - it is common for one fix to expose a different bottleneck that was masked.

Q: What is the difference between a sampling profiler and a tracing profiler, and what are the tradeoffs?

A: A tracing profiler (cProfile, Java's JFR in tracing mode) instruments every event of interest - typically every function call and return. It records exactly when each event happened, how long each function took, and the exact call count. The overhead is proportional to the number of events: code with millions of function calls per second may see 10-100x slowdown, which distorts the profile results and makes some bottlenecks appear worse than they are. A sampling profiler (py-spy, Austin, Google Cloud Profiler) runs alongside the target process and periodically interrupts it to record the current call stack. The sample rate is configurable (typically 100-1000 samples/second). Functions that consume more CPU time appear in more samples. Functions that run in under one sampling interval may not appear at all. Overhead is near-zero because the target process runs normally between samples. The tradeoff is statistical noise: sampling profilers give good estimates of hot functions but miss rare events. Tracing profilers give exact measurements but may have enough overhead to change what you are measuring. For production profiling, sampling profilers win on overhead. For development profiling of correctness (counting exact call frequencies), tracing profilers give precise answers.


Summary

Profiling is not a step you add after you are done writing code. It is the measurement infrastructure that makes every performance decision evidence-based rather than intuition-based.

The tool hierarchy is practical: start with GPU utilization (nvidia-smi) to identify whether you are GPU-bound. Use cProfile or py-spy to find the expensive function. Use line_profiler to find the expensive line inside that function. Use torch.profiler and CUDA events for GPU-specific timing. Use Scalene when you need to distinguish Python overhead from native code time, which determines whether Cython or algorithmic improvement is the right fix.

The engineer who spent three weeks optimizing the wrong function was not incompetent. They optimized what looked expensive. The engineer who fixed the real bottleneck in four hours measured first. The profiler is what made the difference. It always is.

© 2026 EngineersOfAI. All rights reserved.