cProfile and pstats - Function-Level Profiling
Before reading further, predict the output of this profiling session:
import cProfile
def outer():
for _ in range(1000):
inner()
def inner():
return sum(range(100))
cProfile.run('outer()')
How many function calls will cProfile report? What will have the highest tottime? What will have the highest cumtime?
The answer: cProfile reports approximately 4,005 function calls:
outer()called onceinner()called 1,000 timessum()called 1,000 times (insideinner)range()called 1,000 times (insideinner)- Plus the top-level
<module>call and a few internal calls
4005 function calls in 0.012 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.001 0.001 0.012 0.012 <string>:1(<module>)
1 0.002 0.002 0.011 0.011 example.py:3(outer)
1000 0.003 0.000 0.009 0.000 example.py:7(inner)
1000 0.004 0.000 0.004 0.000 {built-in method builtins.sum}
1000 0.002 0.000 0.002 0.000 {built-in method builtins.range}
1 0.000 0.000 0.012 0.012 {built-in method builtins.exec}
Notice: inner has high cumtime (includes sum + range time) but outer has low tottime (its own code just loops and calls). This distinction between tottime and cumtime is the single most important concept in this lesson.
What You Will Learn
- How cProfile works internally (deterministic tracing via
PyEval_SetProfile) - The precise meaning of every column in cProfile output
- How to use pstats to sort, filter, and drill into profiles
- How to visualize profiles with snakeviz
- The difference between deterministic and statistical profiling
- How to measure and account for profiling overhead
- How to profile a real FastAPI endpoint end-to-end
Prerequisites
- Completed Lesson 1 (Profiling Strategy - Amdahl's Law, benchmarking methodology)
- Understanding of the CPython call stack (from Intermediate course)
- Basic familiarity with function call overhead in Python
Part 1 - How cProfile Works
cProfile is a deterministic profiler. It instruments every function call and return by registering a C-level callback via PyEval_SetProfile. Every time CPython enters or exits a function, cProfile's callback fires and records a timestamp.
Because it traces every call, cProfile's overhead is proportional to the number of function calls, not the total execution time. A tight loop calling a trivial function millions of times will show significant profiling overhead.
cProfile vs profile
Python ships two profiling modules:
| Feature | cProfile | profile |
|---|---|---|
| Implementation | C extension | Pure Python |
| Overhead | ~2-5x slowdown | ~10-50x slowdown |
| Use case | Production profiling | When you need to subclass the profiler |
| Availability | CPython only | All Python implementations |
Always use cProfile unless you have a specific reason to customize the profiler class.
Part 2 - Reading Profile Output
Every column in cProfile output has a precise meaning. Getting these wrong leads to incorrect optimization decisions.
import cProfile
import pstats
import io
def simulate_workload():
"""A function with multiple call patterns to illustrate profile columns."""
def parse_records(data):
"""Called many times, each call is cheap."""
return [item.strip() for item in data]
def validate_batch(records):
"""Called once per batch, does moderate work."""
for record in records:
if not record:
raise ValueError("Empty record")
return True
def expensive_transform(records):
"""Called once, does heavy computation."""
result = []
for record in records:
transformed = record.upper()
for _ in range(100):
transformed = transformed.swapcase()
result.append(transformed)
return result
data = [f" record_{i} " for i in range(500)]
for _ in range(200):
parsed = parse_records(data)
validate_batch(parsed)
expensive_transform(parsed)
# Profile it
profiler = cProfile.Profile()
profiler.enable()
simulate_workload()
profiler.disable()
# Print stats
stream = io.StringIO()
stats = pstats.Stats(profiler, stream=stream)
stats.sort_stats('cumulative')
stats.print_stats(10)
print(stream.getvalue())
Column Reference
| Column | Meaning | What it tells you |
|---|---|---|
ncalls | Number of times this function was called | High ncalls with high tottime = optimization target |
tottime | Total time spent in this function only (excludes sub-calls) | Where CPU cycles actually burn |
percall (1st) | tottime / ncalls | Cost per invocation (excluding sub-calls) |
cumtime | Cumulative time in this function including all sub-calls | Total wall-clock cost of calling this function |
percall (2nd) | cumtime / ncalls | Total cost per invocation (including sub-calls) |
filename:lineno(function) | Where the function is defined | Navigation to source |
:::danger tottime vs cumtime - The Critical Distinction
tottimeanswers: "How much time did this function's own code consume?"cumtimeanswers: "How much time passed between entering and leaving this function?"
If a function has high cumtime but low tottime, it is not slow itself - it calls something slow. Follow the call chain down until you find the function where tottime is high. That is your real bottleneck.
:::
The ncalls Column: Recursive Calls
When a function is recursive, ncalls shows two numbers separated by a slash:
ncalls tottime ... filename:lineno(function)
120/1 0.003 ... example.py:10(recursive_func)
This means: 120 total calls, but only 1 primitive (non-recursive) call. The function entered itself 119 times from that single entry point.
Interpreting a Real Profile
# A realistic profile might look like:
#
# ncalls tottime percall cumtime percall filename:lineno(function)
# 1 0.000 0.000 3.450 3.450 views.py:15(course_list)
# 500 0.010 0.000 3.200 0.006 serializers.py:30(serialize)
# 500 0.005 0.000 2.800 0.006 models.py:50(to_dict)
# 10000 2.700 0.000 2.700 0.000 {method 'execute' of 'cursor'}
# 500 0.080 0.000 0.080 0.000 {method 'fetchall' of 'cursor'}
#
# Reading this profile:
# 1. course_list has 3.45s cumtime but 0.000s tottime → it's a dispatcher
# 2. serialize is called 500 times → one per course
# 3. The real time sink: cursor.execute at 2.7s tottime for 10,000 calls
# 4. 10,000 queries for 500 courses = 20 queries per course = N+1 problem
Part 3 - Three Ways to Use cProfile
Method 1: Command Line
The simplest way to profile any Python script:
# Profile a script, output sorted by cumulative time
python -m cProfile -s cumulative my_script.py
# Sort by total time (where CPU actually burns)
python -m cProfile -s tottime my_script.py
# Save profile data to a file for later analysis
python -m cProfile -o profile_output.prof my_script.py
Method 2: Programmatic Wrapping
For profiling specific functions within a larger application:
import cProfile
def profile_function(func, *args, **kwargs):
"""Profile a single function call and return results + profiler."""
profiler = cProfile.Profile()
profiler.enable()
result = func(*args, **kwargs)
profiler.disable()
return result, profiler
# Usage
result, prof = profile_function(process_dataset, raw_data)
# Analyze
import pstats
stats = pstats.Stats(prof)
stats.strip_dirs()
stats.sort_stats('tottime')
stats.print_stats(20)
Method 3: Context Manager
For profiling arbitrary code blocks:
import cProfile
import pstats
from contextlib import contextmanager
@contextmanager
def profiled(sort_by='cumulative', lines=20):
"""Context manager for profiling a block of code."""
profiler = cProfile.Profile()
profiler.enable()
try:
yield profiler
finally:
profiler.disable()
stats = pstats.Stats(profiler)
stats.strip_dirs()
stats.sort_stats(sort_by)
stats.print_stats(lines)
# Usage
with profiled(sort_by='tottime', lines=10):
result = process_large_dataset(data)
# Or capture the profiler for later analysis
with profiled() as prof:
train_model(X, y)
# prof is available here for further analysis
Method 4: Decorator
import cProfile
import pstats
import functools
def profile_decorator(output_file=None, sort_by='cumulative', lines=20):
"""Decorator that profiles the wrapped function."""
def decorator(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
profiler = cProfile.Profile()
profiler.enable()
try:
result = func(*args, **kwargs)
finally:
profiler.disable()
stats = pstats.Stats(profiler)
stats.strip_dirs()
stats.sort_stats(sort_by)
stats.print_stats(lines)
if output_file:
profiler.dump_stats(output_file)
return result
return wrapper
return decorator
# Usage
@profile_decorator(output_file='train.prof', sort_by='tottime')
def train_model(X, y, epochs=100):
# ... training logic ...
pass
Part 4 - pstats for Deep Analysis
Raw cProfile output is overwhelming for real applications with hundreds of functions. pstats is the analysis layer that lets you slice and dice profile data.
Sorting
import cProfile
import pstats
# Assume we have a profiler with data
profiler = cProfile.Profile()
profiler.enable()
# ... workload ...
profiler.disable()
stats = pstats.Stats(profiler)
stats.strip_dirs() # Remove long path prefixes for readability
# Sort by different criteria
stats.sort_stats('tottime') # Where CPU actually burns
stats.print_stats(10)
stats.sort_stats('cumulative') # Total cost including sub-calls
stats.print_stats(10)
stats.sort_stats('calls') # Most frequently called
stats.print_stats(10)
Valid sort keys:
| Sort Key | Meaning |
|---|---|
'calls' or 'ncalls' | Number of calls |
'cumulative' or 'cumtime' | Cumulative time |
'tottime' | Total time in function |
'filename' | File name |
'line' | Line number |
'name' | Function name |
'nfl' | Name/file/line |
'stdname' | Standard name (default) |
Filtering
# Show only functions matching a string pattern
stats.sort_stats('cumulative')
stats.print_stats('database') # Only functions with 'database' in the name
stats.print_stats('models.py') # Only functions in models.py
# Limit output count
stats.print_stats(20) # Top 20 functions
stats.print_stats(0.1) # Top 10% of functions
# Combine: top 10 database-related functions by cumulative time
stats.sort_stats('cumulative')
stats.print_stats('database', 10)
Caller/Callee Analysis
This is where pstats becomes truly powerful. You can trace the call graph:
# Who calls the slow function?
stats.print_callers('execute')
# Output:
# Function was called by...
# {cursor.execute} <- models.py:50(to_dict) 10000 times
# <- views.py:20(get_stats) 5 times
# What does the slow function call?
stats.print_callees('course_list')
# Output:
# Function called...
# views.py:15(course_list) -> serializers.py:30(serialize) 500 times
# -> models.py:10(get_all) 1 time
Practical pstats Workflow
import cProfile
import pstats
def analyze_profile(prof_file: str):
"""
A systematic workflow for analyzing a saved profile.
Step 1: Get the big picture
Step 2: Find the hotspot
Step 3: Trace the call chain
Step 4: Identify the root cause
"""
stats = pstats.Stats(prof_file)
stats.strip_dirs()
# Step 1: Big picture - how many calls, total time
print("=" * 60)
print("STEP 1: Big Picture")
print(f"Total calls: {stats.total_calls}")
print(f"Total time: {stats.total_tt:.4f}s")
print()
# Step 2: Top functions by tottime (where CPU burns)
print("=" * 60)
print("STEP 2: Top 10 by tottime (CPU hotspots)")
stats.sort_stats('tottime')
stats.print_stats(10)
# Step 3: Top functions by cumtime (most expensive call trees)
print("=" * 60)
print("STEP 3: Top 10 by cumtime (expensive call trees)")
stats.sort_stats('cumulative')
stats.print_stats(10)
# Step 4: Most-called functions (potential for caching)
print("=" * 60)
print("STEP 4: Top 10 by call count (caching opportunities)")
stats.sort_stats('calls')
stats.print_stats(10)
return stats # Return for interactive exploration
# Usage:
# stats = analyze_profile('my_app.prof')
# stats.print_callers('the_slow_function')
# stats.print_callees('the_slow_function')
Saving and Merging Profiles
# Save profile for later analysis
profiler.dump_stats('my_profile.prof')
# Load and analyze later
stats = pstats.Stats('my_profile.prof')
stats.strip_dirs()
stats.sort_stats('cumulative')
stats.print_stats(20)
# Merge multiple profiles (e.g., from different test runs)
stats = pstats.Stats('run1.prof')
stats.add('run2.prof')
stats.add('run3.prof')
# Now stats contains aggregated data from all three runs
stats.sort_stats('cumulative')
stats.print_stats(20)
Part 5 - Visualization with snakeviz
Text-based profile output works for targeted analysis, but understanding the full call hierarchy of a complex application requires visualization.
# Install
pip install snakeviz
# Generate a profile
python -m cProfile -o output.prof my_script.py
# Visualize - opens in your browser
snakeviz output.prof
Programmatic Visualization
import cProfile
import subprocess
import sys
def profile_and_visualize(func, *args, output_file='profile.prof', **kwargs):
"""Profile a function and open snakeviz automatically."""
profiler = cProfile.Profile()
profiler.enable()
result = func(*args, **kwargs)
profiler.disable()
profiler.dump_stats(output_file)
print(f"Profile saved to {output_file}")
print("Opening snakeviz...")
subprocess.Popen([sys.executable, '-m', 'snakeviz', output_file])
return result
:::tip Reading snakeviz Icicle Charts In the icicle view (recommended for beginners), the topmost bar is the entry point. Each row below shows functions called by the row above. The width of each rectangle is proportional to cumulative time. Look for wide rectangles deep in the stack - those are your hotspots. Narrow rectangles at any depth are not worth optimizing. :::
Alternative Visualization: gprof2dot
For static call-graph images (useful in documentation or CI reports):
pip install gprof2dot graphviz
# Generate a dot graph from profile data
gprof2dot -f pstats profile.prof | dot -Tpng -o call_graph.png
# Or SVG for interactive viewing
gprof2dot -f pstats profile.prof | dot -Tsvg -o call_graph.svg
Part 6 - Deterministic vs Statistical Profiling
cProfile is not the only profiling approach. Understanding the trade-offs helps you choose the right tool.
| Aspect | Deterministic (cProfile) | Statistical (py-spy, pyinstrument) |
|---|---|---|
| Mechanism | Callback on every call/return | Periodic stack sampling |
| Overhead | 2-5x (proportional to call count) | 1-5% (constant, independent of calls) |
| Accuracy | Exact call counts and times | Approximate (sampling error) |
| Best for | Development profiling, exact counts | Production profiling, always-on |
| Blind spots | Distorts timing of call-heavy code | Misses very short functions |
| GIL | Must hold the GIL | py-spy samples from outside the process |
When cProfile Lies
cProfile adds a fixed cost per function call. This disproportionately inflates functions that make many cheap calls:
import cProfile
import time
def many_small_calls():
"""1 million tiny function calls."""
def tiny():
pass
for _ in range(1_000_000):
tiny()
def few_large_calls():
"""10 calls that each do real work."""
def heavy():
total = 0
for i in range(100_000):
total += i * i
return total
for _ in range(10):
heavy()
# Without profiling
start = time.perf_counter()
many_small_calls()
t1 = time.perf_counter() - start
start = time.perf_counter()
few_large_calls()
t2 = time.perf_counter() - start
print(f"Without profiling:")
print(f" many_small_calls: {t1:.3f}s")
print(f" few_large_calls: {t2:.3f}s")
# With profiling - many_small_calls gets disproportionately slower
print(f"\nWith cProfile:")
cProfile.run('many_small_calls()')
cProfile.run('few_large_calls()')
:::danger Profiling Overhead Can Change Your Conclusions
If function A makes 1 million calls to tiny helpers and function B makes 10 calls to heavy workers, cProfile may report A as slower even if B dominates real wall-clock time. Always verify critical findings with a plain time.perf_counter() measurement.
:::
Measuring Profiling Overhead
import cProfile
import time
import statistics
def workload():
"""A realistic workload to measure overhead against."""
data = list(range(10_000))
for _ in range(100):
sorted_data = sorted(data)
filtered = [x for x in sorted_data if x % 2 == 0]
total = sum(filtered)
return total
# Measure without profiling
times_bare = []
for _ in range(10):
start = time.perf_counter()
workload()
times_bare.append(time.perf_counter() - start)
# Measure with profiling
times_profiled = []
for _ in range(10):
profiler = cProfile.Profile()
start = time.perf_counter()
profiler.enable()
workload()
profiler.disable()
times_profiled.append(time.perf_counter() - start)
bare_median = statistics.median(times_bare)
prof_median = statistics.median(times_profiled)
overhead = (prof_median - bare_median) / bare_median * 100
print(f"Bare median: {bare_median:.4f}s")
print(f"Profiled median: {prof_median:.4f}s")
print(f"Overhead: {overhead:.1f}%")
# Typical output: Overhead: 15-40% depending on call density
Part 7 - Real-World: Profiling a FastAPI Endpoint
Profiling web endpoints requires special patterns because the framework handles routing, middleware, and async dispatch.
Profiling Middleware
import cProfile
import pstats
import io
from fastapi import FastAPI, Request
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.responses import Response
app = FastAPI()
class ProfilingMiddleware(BaseHTTPMiddleware):
"""
Add ?profile=1 to any request to get cProfile output in server logs.
WARNING: Never enable in production without authentication.
"""
async def dispatch(self, request: Request, call_next):
if request.query_params.get("profile") != "1":
return await call_next(request)
profiler = cProfile.Profile()
profiler.enable()
response = await call_next(request)
profiler.disable()
# Format and log the profile
stream = io.StringIO()
stats = pstats.Stats(profiler, stream=stream)
stats.strip_dirs()
stats.sort_stats('cumulative')
stats.print_stats(30)
print(stream.getvalue())
# Add summary to response headers
response.headers["X-Profile-Calls"] = str(stats.total_calls)
response.headers["X-Profile-Time"] = f"{stats.total_tt:.4f}s"
return response
# Only add in development
import os
if os.getenv("ENABLE_PROFILING") == "1":
app.add_middleware(ProfilingMiddleware)
Endpoint-Level Profiling Decorator
import cProfile
import pstats
import io
import functools
import logging
logger = logging.getLogger(__name__)
def profile_endpoint(sort_by='cumulative', lines=20):
"""Decorator to profile a specific async endpoint handler."""
def decorator(func):
@functools.wraps(func)
async def wrapper(*args, **kwargs):
profiler = cProfile.Profile()
profiler.enable()
try:
result = await func(*args, **kwargs)
finally:
profiler.disable()
stream = io.StringIO()
stats = pstats.Stats(profiler, stream=stream)
stats.strip_dirs()
stats.sort_stats(sort_by)
stats.print_stats(lines)
logger.info(f"Profile for {func.__name__}:\n{stream.getvalue()}")
return result
return wrapper
return decorator
# Usage - add temporarily to investigate a slow endpoint
@app.get("/api/courses")
@profile_endpoint(sort_by='tottime', lines=15)
async def list_courses():
courses = await db.fetch_all(select(Course))
return [course.to_dict() for course in courses]
Diagnosing the N+1 Query Problem
The most common performance bug in web applications, and cProfile reveals it clearly:
# Profile output for a slow /api/courses endpoint:
#
# ncalls tottime cumtime filename:lineno(function)
# 1 0.000 2.550 views.py:10(list_courses)
# 500 0.010 2.500 serializers.py:5(serialize_course)
# 500 0.005 2.300 models.py:20(get_instructor)
# 501 2.200 2.200 asyncpg/connection.py:100(execute)
# 500 0.080 0.080 asyncpg/connection.py:150(fetchrow)
#
# Diagnosis:
# - 501 execute calls = 1 query for courses + 500 queries for instructors
# - Each serialize_course calls get_instructor which hits the DB
# - This is the classic N+1 problem
# Fix: eager load the relationship
# Before: SELECT * FROM courses; then 500x SELECT * FROM instructors WHERE id=?
# After: SELECT * FROM courses JOIN instructors ON courses.instructor_id = instructors.id
#
# Post-fix profile:
# ncalls tottime cumtime filename:lineno(function)
# 1 0.000 0.065 views.py:10(list_courses)
# 500 0.008 0.010 serializers.py:5(serialize_course)
# 1 0.050 0.050 asyncpg/connection.py:100(execute)
#
# 501 queries → 1 query. 2.55s → 0.065s. A 39x speedup.
Profiling in Automated Tests
import cProfile
import pstats
import pytest
from pathlib import Path
PROFILE_DIR = Path("profiles")
@pytest.fixture
def profile_test(request):
"""pytest fixture that profiles the test and saves results."""
PROFILE_DIR.mkdir(exist_ok=True)
profiler = cProfile.Profile()
profiler.enable()
yield profiler
profiler.disable()
test_name = request.node.name
prof_file = PROFILE_DIR / f"{test_name}.prof"
profiler.dump_stats(str(prof_file))
# Print summary
stats = pstats.Stats(profiler)
stats.strip_dirs()
stats.sort_stats('cumulative')
stats.print_stats(10)
# Fail test if too slow
if stats.total_tt > 5.0:
pytest.fail(f"{test_name} took {stats.total_tt:.2f}s (budget: 5.0s)")
# Usage
def test_course_listing(client, profile_test):
response = client.get("/api/courses")
assert response.status_code == 200
assert len(response.json()) > 0
Part 8 - Building a Reusable Profiling Toolkit
Here is a production-quality toolkit that wraps the patterns from this lesson:
import cProfile
import pstats
import io
import time
from dataclasses import dataclass, field
from typing import Optional, Callable, Any
from pathlib import Path
@dataclass
class ProfileReport:
"""Structured profiling report for programmatic analysis."""
function_name: str
total_calls: int
total_time: float
wall_time: float
overhead_pct: float
top_tottime: str
top_cumtime: str
def summary(self) -> str:
return (
f"Profile: {self.function_name}\n"
f" Total calls: {self.total_calls:,}\n"
f" CPU time: {self.total_time:.4f}s\n"
f" Wall time: {self.wall_time:.4f}s\n"
f" Overhead: ~{self.overhead_pct:.1f}%\n"
)
def profile(func: Callable, *args: Any,
save_to: Optional[str] = None,
top_n: int = 15,
**kwargs: Any) -> tuple[Any, ProfileReport]:
"""
Profile a function call and return (result, ProfileReport).
Usage:
result, report = profile(my_function, arg1, arg2)
print(report.summary())
"""
# Measure wall time for overhead estimation
wall_start = time.perf_counter()
profiler = cProfile.Profile()
profiler.enable()
result = func(*args, **kwargs)
profiler.disable()
wall_time = time.perf_counter() - wall_start
if save_to:
profiler.dump_stats(save_to)
# Extract tottime-sorted output
stream = io.StringIO()
stats = pstats.Stats(profiler, stream=stream)
stats.strip_dirs()
stats.sort_stats('tottime')
stats.print_stats(top_n)
top_tottime = stream.getvalue()
# Extract cumtime-sorted output
stream = io.StringIO()
stats = pstats.Stats(profiler, stream=stream)
stats.strip_dirs()
stats.sort_stats('cumulative')
stats.print_stats(top_n)
top_cumtime = stream.getvalue()
# Estimate overhead
overhead = max(0, (wall_time - stats.total_tt) / stats.total_tt * 100)
report = ProfileReport(
function_name=func.__name__,
total_calls=stats.total_calls,
total_time=stats.total_tt,
wall_time=wall_time,
overhead_pct=overhead,
top_tottime=top_tottime,
top_cumtime=top_cumtime,
)
return result, report
def compare_profiles(baseline_file: str, current_file: str,
threshold: float = 0.10) -> dict:
"""
Compare two saved profiles and report regressions.
Returns a dict with comparison metrics and any violations.
"""
baseline = pstats.Stats(baseline_file)
current = pstats.Stats(current_file)
time_change = (current.total_tt - baseline.total_tt) / baseline.total_tt
call_change = (current.total_calls - baseline.total_calls) / baseline.total_calls
result = {
"baseline_time": baseline.total_tt,
"current_time": current.total_tt,
"time_change_pct": time_change * 100,
"baseline_calls": baseline.total_calls,
"current_calls": current.total_calls,
"call_change_pct": call_change * 100,
"violations": [],
}
if time_change > threshold:
result["violations"].append(
f"Time increased by {time_change:.1%} "
f"({baseline.total_tt:.3f}s -> {current.total_tt:.3f}s)"
)
if call_change > threshold:
result["violations"].append(
f"Call count increased by {call_change:.1%} "
f"({baseline.total_calls} -> {current.total_calls})"
)
return result
Key Takeaways
- cProfile is deterministic: it traces every function call and return via
PyEval_SetProfile. This gives exact call counts but adds per-call overhead. tottimeis where CPU burns;cumtimeis total cost: whencumtimeis high buttottimeis low, the function itself is fast - it calls something slow. Follow the chain down.- pstats is the analysis layer: use
sort_stats,print_stats,print_callers, andprint_calleesto navigate from symptom to root cause. - snakeviz makes profiles visual: the icicle view shows time proportions at a glance. Wide rectangles deep in the stack are your optimization targets.
- Profiling overhead distorts timing: functions with millions of tiny calls appear disproportionately slow under cProfile. Always cross-check with wall-clock timing.
- Profile in tests: automate profiling in your test suite to catch regressions before they reach production.
- Never expose profiling in production without authentication: it leaks internal architecture and can be used as a DoS vector.
Graded Practice Challenges
Level 1 - Predict the Output
Question 1: Given this profile output, which function should you optimize first and why?
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.001 0.001 5.230 5.230 main.py:1(main)
100 0.002 0.000 5.100 0.051 main.py:10(process_batch)
10000 0.050 0.000 5.000 0.001 main.py:20(process_item)
10000 4.900 0.000 4.900 0.000 main.py:30(compute_hash)
10000 0.040 0.000 0.040 0.000 main.py:40(validate)
Answer
Optimize compute_hash (line 30). It has the highest tottime at 4.900s out of 5.230s total - it consumes 93.7% of total execution time. By Amdahl's Law, this is the only function where optimization will have meaningful impact.
process_batch has high cumtime (5.100s) but near-zero tottime (0.002s) - it is just a loop that delegates to process_item. Optimizing process_batch itself would save 0.002s.
Question 2: What does ncalls = 120/3 mean in cProfile output?
Answer
It means the function was called 120 times total, of which 3 were primitive (non-recursive) calls. The function is recursive: it was entered from external code 3 times, and within those 3 invocations it called itself 117 additional times.
Question 3: You profile a function and see it takes 2.0s under cProfile. You time it with time.perf_counter() and it takes 0.8s. What explains the 1.2s discrepancy?
Answer
The 1.2s difference is profiling overhead. cProfile adds a fixed cost to every function call and return. If the function makes many internal function calls (e.g., calling helpers in a tight loop), each call incurs a profiling callback cost that accumulates.
The true execution time is closer to 0.8s. This is a fundamental limitation of deterministic profiling. For call-heavy code, consider a statistical profiler like py-spy or pyinstrument.
Level 2 - Debug Challenge
This profiling code is supposed to find the slowest endpoint, but it produces incorrect results. Find and fix all the bugs:
import cProfile
import pstats
def profile_endpoints(app, endpoints):
"""Profile each endpoint and rank by total time."""
results = {}
profiler = cProfile.Profile()
for endpoint in endpoints:
profiler.enable()
app.test_client().get(endpoint)
profiler.disable()
stats = pstats.Stats(profiler)
results[endpoint] = stats.total_tt
ranked = sorted(results.items(), key=lambda x: x[1])
for endpoint, t in ranked:
print(f"{endpoint}: {t:.4f}s")
Answer
Three bugs:
-
Reused profiler: The same
cProfile.Profile()instance accumulates stats across all endpoints. Each endpoint'stotal_ttincludes all previous endpoints. Create a new profiler for each endpoint. -
Sorted ascending:
sorteddefaults to ascending order, showing the fastest first. For finding the slowest, addreverse=True. -
Missing
strip_dirs: Minor readability issue - function names include full file paths.
Fixed version:
def profile_endpoints(app, endpoints):
"""Profile each endpoint and rank by total time."""
results = {}
for endpoint in endpoints:
profiler = cProfile.Profile() # Fresh profiler per endpoint
profiler.enable()
app.test_client().get(endpoint)
profiler.disable()
stats = pstats.Stats(profiler)
results[endpoint] = stats.total_tt
ranked = sorted(results.items(), key=lambda x: x[1], reverse=True)
for endpoint, t in ranked:
print(f"{endpoint}: {t:.4f}s")
Level 3 - Design Challenge
You are on-call and a production FastAPI service has degraded from p50=50ms to p50=800ms. You cannot attach a debugger, cannot add significant overhead, and cannot deploy new code immediately. Design a profiling strategy that:
- Identifies the bottleneck without taking down the service
- Works with async endpoints
- Produces actionable data within 30 minutes
- Does not require a code deployment
Describe your approach, tools, and the first three commands you would run.
Solution Sketch
Step 1: Triage (5 minutes)
- Check APM dashboards for which endpoint(s) regressed
- Check database slow query logs
- Check resource metrics: CPU, memory, connection pool, event loop lag
Step 2: Statistical profiling with py-spy (10 minutes) py-spy attaches to a running process without code changes:
# See live top-like view of where time is spent
py-spy top --pid $(pgrep -f uvicorn)
# Record a flame graph over 60 seconds
py-spy record -o profile.svg --pid $(pgrep -f uvicorn) --duration 60
# Sample at higher frequency for more accuracy
py-spy record -o profile.svg --pid $(pgrep -f uvicorn) --rate 200
Step 3: Interpret the flame graph (10 minutes)
- Look for wide plateaus - those are where time is spent
- If dominated by
socket.recvorasyncpg.execute- the database is the bottleneck - If dominated by CPU-bound Python code - identify the function and trace to source
Step 4: Confirm and fix (5 minutes)
- Reproduce locally with production-like data
- Profile locally with cProfile for exact call counts
- Deploy the fix via normal CI/CD
Why not cProfile?
- Requires code changes (deployment)
- 2-5x overhead could turn p50=800ms into p50=2000ms, making the outage worse
- py-spy operates externally with ~1% overhead
What's Next
cProfile tells you which functions are slow, but not which lines within those functions. In line_profiler and memory_profiler, you will learn to profile at the line level - seeing exactly which statement consumes the most time or allocates the most memory.
