What is python cProfile?

Master deterministic profiling with cProfile and pstats - reading profile output, sorting and filtering results, snakeviz visualization, profiling overhead, and real-world endpoint profiling.

How does python pstats work in practice?

cProfile and pstats - Function-Level Profiling covers python cProfile, python pstats, function level profiling from first principles with code examples. Free lesson at https://engineersofai.com/docs/python/python-advanced/performance-engineering/cprofile-and-pstats

What is the difference between python cProfile and function level profiling?

See the full breakdown at https://engineersofai.com/docs/python/python-advanced/performance-engineering/cprofile-and-pstats

cProfile and pstats - Function-Level Profiling

Before reading further, predict the output of this profiling session:

import cProfile

def outer():
    for _ in range(1000):
        inner()

def inner():
    return sum(range(100))

cProfile.run('outer()')

How many function calls will cProfile report? What will have the highest tottime? What will have the highest cumtime?

The answer: cProfile reports approximately 4,005 function calls:

outer() called once
inner() called 1,000 times
sum() called 1,000 times (inside inner)
range() called 1,000 times (inside inner)
Plus the top-level <module> call and a few internal calls

         4005 function calls in 0.012 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001    0.012    0.012 <string>:1(<module>)
        1    0.002    0.002    0.011    0.011 example.py:3(outer)
     1000    0.003    0.000    0.009    0.000 example.py:7(inner)
     1000    0.004    0.000    0.004    0.000 {built-in method builtins.sum}
     1000    0.002    0.000    0.002    0.000 {built-in method builtins.range}
        1    0.000    0.000    0.012    0.012 {built-in method builtins.exec}

Notice: inner has high cumtime (includes sum + range time) but outer has low tottime (its own code just loops and calls). This distinction between tottime and cumtime is the single most important concept in this lesson.

What You Will Learn

How cProfile works internally (deterministic tracing via PyEval_SetProfile)
The precise meaning of every column in cProfile output
How to use pstats to sort, filter, and drill into profiles
How to visualize profiles with snakeviz
The difference between deterministic and statistical profiling
How to measure and account for profiling overhead
How to profile a real FastAPI endpoint end-to-end

Prerequisites

Completed Lesson 1 (Profiling Strategy - Amdahl's Law, benchmarking methodology)
Understanding of the CPython call stack (from Intermediate course)
Basic familiarity with function call overhead in Python

Part 1 - How cProfile Works

cProfile is a deterministic profiler. It instruments every function call and return by registering a C-level callback via PyEval_SetProfile. Every time CPython enters or exits a function, cProfile's callback fires and records a timestamp.

Because it traces every call, cProfile's overhead is proportional to the number of function calls, not the total execution time. A tight loop calling a trivial function millions of times will show significant profiling overhead.

cProfile vs profile

Python ships two profiling modules:

Feature	`cProfile`	`profile`
Implementation	C extension	Pure Python
Overhead	~2-5x slowdown	~10-50x slowdown
Use case	Production profiling	When you need to subclass the profiler
Availability	CPython only	All Python implementations

Always use cProfile unless you have a specific reason to customize the profiler class.

Part 2 - Reading Profile Output

Every column in cProfile output has a precise meaning. Getting these wrong leads to incorrect optimization decisions.

import cProfile
import pstats
import io

def simulate_workload():
    """A function with multiple call patterns to illustrate profile columns."""

    def parse_records(data):
        """Called many times, each call is cheap."""
        return [item.strip() for item in data]

    def validate_batch(records):
        """Called once per batch, does moderate work."""
        for record in records:
            if not record:
                raise ValueError("Empty record")
        return True

    def expensive_transform(records):
        """Called once, does heavy computation."""
        result = []
        for record in records:
            transformed = record.upper()
            for _ in range(100):
                transformed = transformed.swapcase()
            result.append(transformed)
        return result

    data = [f"  record_{i}  " for i in range(500)]
    for _ in range(200):
        parsed = parse_records(data)
    validate_batch(parsed)
    expensive_transform(parsed)

# Profile it
profiler = cProfile.Profile()
profiler.enable()
simulate_workload()
profiler.disable()

# Print stats
stream = io.StringIO()
stats = pstats.Stats(profiler, stream=stream)
stats.sort_stats('cumulative')
stats.print_stats(10)
print(stream.getvalue())

Column Reference

Column	Meaning	What it tells you
`ncalls`	Number of times this function was called	High ncalls with high tottime = optimization target
`tottime`	Total time spent in this function only (excludes sub-calls)	Where CPU cycles actually burn
`percall` (1st)	`tottime / ncalls`	Cost per invocation (excluding sub-calls)
`cumtime`	Cumulative time in this function including all sub-calls	Total wall-clock cost of calling this function
`percall` (2nd)	`cumtime / ncalls`	Total cost per invocation (including sub-calls)
`filename:lineno(function)`	Where the function is defined	Navigation to source

:::danger tottime vs cumtime - The Critical Distinction

tottime answers: "How much time did this function's own code consume?"
cumtime answers: "How much time passed between entering and leaving this function?"

If a function has high cumtime but low tottime, it is not slow itself - it calls something slow. Follow the call chain down until you find the function where tottime is high. That is your real bottleneck. :::

The `ncalls` Column: Recursive Calls

When a function is recursive, ncalls shows two numbers separated by a slash:

ncalls  tottime  ...  filename:lineno(function)
 120/1    0.003  ...  example.py:10(recursive_func)

This means: 120 total calls, but only 1 primitive (non-recursive) call. The function entered itself 119 times from that single entry point.

Interpreting a Real Profile

# A realistic profile might look like:
#
#    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
#         1    0.000    0.000    3.450    3.450 views.py:15(course_list)
#       500    0.010    0.000    3.200    0.006 serializers.py:30(serialize)
#       500    0.005    0.000    2.800    0.006 models.py:50(to_dict)
#     10000    2.700    0.000    2.700    0.000 {method 'execute' of 'cursor'}
#       500    0.080    0.000    0.080    0.000 {method 'fetchall' of 'cursor'}
#
# Reading this profile:
# 1. course_list has 3.45s cumtime but 0.000s tottime → it's a dispatcher
# 2. serialize is called 500 times → one per course
# 3. The real time sink: cursor.execute at 2.7s tottime for 10,000 calls
# 4. 10,000 queries for 500 courses = 20 queries per course = N+1 problem

Part 3 - Three Ways to Use cProfile

Method 1: Command Line

The simplest way to profile any Python script:

# Profile a script, output sorted by cumulative time
python -m cProfile -s cumulative my_script.py

# Sort by total time (where CPU actually burns)
python -m cProfile -s tottime my_script.py

# Save profile data to a file for later analysis
python -m cProfile -o profile_output.prof my_script.py

Method 2: Programmatic Wrapping

For profiling specific functions within a larger application:

import cProfile

def profile_function(func, *args, **kwargs):
    """Profile a single function call and return results + profiler."""
    profiler = cProfile.Profile()
    profiler.enable()
    result = func(*args, **kwargs)
    profiler.disable()
    return result, profiler

# Usage
result, prof = profile_function(process_dataset, raw_data)

# Analyze
import pstats
stats = pstats.Stats(prof)
stats.strip_dirs()
stats.sort_stats('tottime')
stats.print_stats(20)

Method 3: Context Manager

For profiling arbitrary code blocks:

import cProfile
import pstats
from contextlib import contextmanager

@contextmanager
def profiled(sort_by='cumulative', lines=20):
    """Context manager for profiling a block of code."""
    profiler = cProfile.Profile()
    profiler.enable()
    try:
        yield profiler
    finally:
        profiler.disable()
        stats = pstats.Stats(profiler)
        stats.strip_dirs()
        stats.sort_stats(sort_by)
        stats.print_stats(lines)

# Usage
with profiled(sort_by='tottime', lines=10):
    result = process_large_dataset(data)

# Or capture the profiler for later analysis
with profiled() as prof:
    train_model(X, y)
# prof is available here for further analysis

Method 4: Decorator

import cProfile
import pstats
import functools

def profile_decorator(output_file=None, sort_by='cumulative', lines=20):
    """Decorator that profiles the wrapped function."""
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            profiler = cProfile.Profile()
            profiler.enable()
            try:
                result = func(*args, **kwargs)
            finally:
                profiler.disable()
                stats = pstats.Stats(profiler)
                stats.strip_dirs()
                stats.sort_stats(sort_by)
                stats.print_stats(lines)
                if output_file:
                    profiler.dump_stats(output_file)
            return result
        return wrapper
    return decorator

# Usage
@profile_decorator(output_file='train.prof', sort_by='tottime')
def train_model(X, y, epochs=100):
    # ... training logic ...
    pass

Part 4 - pstats for Deep Analysis

Raw cProfile output is overwhelming for real applications with hundreds of functions. pstats is the analysis layer that lets you slice and dice profile data.

Sorting

import cProfile
import pstats

# Assume we have a profiler with data
profiler = cProfile.Profile()
profiler.enable()
# ... workload ...
profiler.disable()

stats = pstats.Stats(profiler)
stats.strip_dirs()  # Remove long path prefixes for readability

# Sort by different criteria
stats.sort_stats('tottime')       # Where CPU actually burns
stats.print_stats(10)

stats.sort_stats('cumulative')    # Total cost including sub-calls
stats.print_stats(10)

stats.sort_stats('calls')         # Most frequently called
stats.print_stats(10)

Valid sort keys:

Sort Key	Meaning
`'calls'` or `'ncalls'`	Number of calls
`'cumulative'` or `'cumtime'`	Cumulative time
`'tottime'`	Total time in function
`'filename'`	File name
`'line'`	Line number
`'name'`	Function name
`'nfl'`	Name/file/line
`'stdname'`	Standard name (default)

Filtering

# Show only functions matching a string pattern
stats.sort_stats('cumulative')
stats.print_stats('database')       # Only functions with 'database' in the name
stats.print_stats('models.py')      # Only functions in models.py

# Limit output count
stats.print_stats(20)               # Top 20 functions
stats.print_stats(0.1)              # Top 10% of functions

# Combine: top 10 database-related functions by cumulative time
stats.sort_stats('cumulative')
stats.print_stats('database', 10)

Caller/Callee Analysis

This is where pstats becomes truly powerful. You can trace the call graph:

# Who calls the slow function?
stats.print_callers('execute')
# Output:
#   Function                    was called by...
#   {cursor.execute}            <- models.py:50(to_dict) 10000 times
#                               <- views.py:20(get_stats) 5 times

# What does the slow function call?
stats.print_callees('course_list')
# Output:
#   Function                    called...
#   views.py:15(course_list)    -> serializers.py:30(serialize) 500 times
#                               -> models.py:10(get_all) 1 time

Practical pstats Workflow

import cProfile
import pstats

def analyze_profile(prof_file: str):
    """
    A systematic workflow for analyzing a saved profile.

    Step 1: Get the big picture
    Step 2: Find the hotspot
    Step 3: Trace the call chain
    Step 4: Identify the root cause
    """
    stats = pstats.Stats(prof_file)
    stats.strip_dirs()

    # Step 1: Big picture - how many calls, total time
    print("=" * 60)
    print("STEP 1: Big Picture")
    print(f"Total calls: {stats.total_calls}")
    print(f"Total time:  {stats.total_tt:.4f}s")
    print()

    # Step 2: Top functions by tottime (where CPU burns)
    print("=" * 60)
    print("STEP 2: Top 10 by tottime (CPU hotspots)")
    stats.sort_stats('tottime')
    stats.print_stats(10)

    # Step 3: Top functions by cumtime (most expensive call trees)
    print("=" * 60)
    print("STEP 3: Top 10 by cumtime (expensive call trees)")
    stats.sort_stats('cumulative')
    stats.print_stats(10)

    # Step 4: Most-called functions (potential for caching)
    print("=" * 60)
    print("STEP 4: Top 10 by call count (caching opportunities)")
    stats.sort_stats('calls')
    stats.print_stats(10)

    return stats  # Return for interactive exploration

# Usage:
# stats = analyze_profile('my_app.prof')
# stats.print_callers('the_slow_function')
# stats.print_callees('the_slow_function')

Saving and Merging Profiles

# Save profile for later analysis
profiler.dump_stats('my_profile.prof')

# Load and analyze later
stats = pstats.Stats('my_profile.prof')
stats.strip_dirs()
stats.sort_stats('cumulative')
stats.print_stats(20)

# Merge multiple profiles (e.g., from different test runs)
stats = pstats.Stats('run1.prof')
stats.add('run2.prof')
stats.add('run3.prof')
# Now stats contains aggregated data from all three runs
stats.sort_stats('cumulative')
stats.print_stats(20)

Part 5 - Visualization with snakeviz

Text-based profile output works for targeted analysis, but understanding the full call hierarchy of a complex application requires visualization.

# Install
pip install snakeviz

# Generate a profile
python -m cProfile -o output.prof my_script.py

# Visualize - opens in your browser
snakeviz output.prof

Programmatic Visualization

import cProfile
import subprocess
import sys

def profile_and_visualize(func, *args, output_file='profile.prof', **kwargs):
    """Profile a function and open snakeviz automatically."""
    profiler = cProfile.Profile()
    profiler.enable()
    result = func(*args, **kwargs)
    profiler.disable()
    profiler.dump_stats(output_file)

    print(f"Profile saved to {output_file}")
    print("Opening snakeviz...")
    subprocess.Popen([sys.executable, '-m', 'snakeviz', output_file])

    return result

:::tip Reading snakeviz Icicle Charts In the icicle view (recommended for beginners), the topmost bar is the entry point. Each row below shows functions called by the row above. The width of each rectangle is proportional to cumulative time. Look for wide rectangles deep in the stack - those are your hotspots. Narrow rectangles at any depth are not worth optimizing. :::

Alternative Visualization: gprof2dot

For static call-graph images (useful in documentation or CI reports):

pip install gprof2dot graphviz

# Generate a dot graph from profile data
gprof2dot -f pstats profile.prof | dot -Tpng -o call_graph.png

# Or SVG for interactive viewing
gprof2dot -f pstats profile.prof | dot -Tsvg -o call_graph.svg

Part 6 - Deterministic vs Statistical Profiling

cProfile is not the only profiling approach. Understanding the trade-offs helps you choose the right tool.

Aspect	Deterministic (cProfile)	Statistical (py-spy, pyinstrument)
Mechanism	Callback on every call/return	Periodic stack sampling
Overhead	2-5x (proportional to call count)	1-5% (constant, independent of calls)
Accuracy	Exact call counts and times	Approximate (sampling error)
Best for	Development profiling, exact counts	Production profiling, always-on
Blind spots	Distorts timing of call-heavy code	Misses very short functions
GIL	Must hold the GIL	py-spy samples from outside the process

When cProfile Lies

cProfile adds a fixed cost per function call. This disproportionately inflates functions that make many cheap calls:

import cProfile
import time

def many_small_calls():
    """1 million tiny function calls."""
    def tiny():
        pass
    for _ in range(1_000_000):
        tiny()

def few_large_calls():
    """10 calls that each do real work."""
    def heavy():
        total = 0
        for i in range(100_000):
            total += i * i
        return total
    for _ in range(10):
        heavy()

# Without profiling
start = time.perf_counter()
many_small_calls()
t1 = time.perf_counter() - start

start = time.perf_counter()
few_large_calls()
t2 = time.perf_counter() - start

print(f"Without profiling:")
print(f"  many_small_calls: {t1:.3f}s")
print(f"  few_large_calls:  {t2:.3f}s")

# With profiling - many_small_calls gets disproportionately slower
print(f"\nWith cProfile:")
cProfile.run('many_small_calls()')
cProfile.run('few_large_calls()')

:::danger Profiling Overhead Can Change Your Conclusions If function A makes 1 million calls to tiny helpers and function B makes 10 calls to heavy workers, cProfile may report A as slower even if B dominates real wall-clock time. Always verify critical findings with a plain time.perf_counter() measurement. :::

Measuring Profiling Overhead

import cProfile
import time
import statistics

def workload():
    """A realistic workload to measure overhead against."""
    data = list(range(10_000))
    for _ in range(100):
        sorted_data = sorted(data)
        filtered = [x for x in sorted_data if x % 2 == 0]
        total = sum(filtered)
    return total

# Measure without profiling
times_bare = []
for _ in range(10):
    start = time.perf_counter()
    workload()
    times_bare.append(time.perf_counter() - start)

# Measure with profiling
times_profiled = []
for _ in range(10):
    profiler = cProfile.Profile()
    start = time.perf_counter()
    profiler.enable()
    workload()
    profiler.disable()
    times_profiled.append(time.perf_counter() - start)

bare_median = statistics.median(times_bare)
prof_median = statistics.median(times_profiled)
overhead = (prof_median - bare_median) / bare_median * 100

print(f"Bare median:     {bare_median:.4f}s")
print(f"Profiled median: {prof_median:.4f}s")
print(f"Overhead:        {overhead:.1f}%")
# Typical output: Overhead: 15-40% depending on call density

Part 7 - Real-World: Profiling a FastAPI Endpoint

Profiling web endpoints requires special patterns because the framework handles routing, middleware, and async dispatch.

Profiling Middleware

import cProfile
import pstats
import io
from fastapi import FastAPI, Request
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.responses import Response

app = FastAPI()

class ProfilingMiddleware(BaseHTTPMiddleware):
    """
    Add ?profile=1 to any request to get cProfile output in server logs.

    WARNING: Never enable in production without authentication.
    """

    async def dispatch(self, request: Request, call_next):
        if request.query_params.get("profile") != "1":
            return await call_next(request)

        profiler = cProfile.Profile()
        profiler.enable()
        response = await call_next(request)
        profiler.disable()

        # Format and log the profile
        stream = io.StringIO()
        stats = pstats.Stats(profiler, stream=stream)
        stats.strip_dirs()
        stats.sort_stats('cumulative')
        stats.print_stats(30)
        print(stream.getvalue())

        # Add summary to response headers
        response.headers["X-Profile-Calls"] = str(stats.total_calls)
        response.headers["X-Profile-Time"] = f"{stats.total_tt:.4f}s"

        return response

# Only add in development
import os
if os.getenv("ENABLE_PROFILING") == "1":
    app.add_middleware(ProfilingMiddleware)

Endpoint-Level Profiling Decorator

import cProfile
import pstats
import io
import functools
import logging

logger = logging.getLogger(__name__)

def profile_endpoint(sort_by='cumulative', lines=20):
    """Decorator to profile a specific async endpoint handler."""
    def decorator(func):
        @functools.wraps(func)
        async def wrapper(*args, **kwargs):
            profiler = cProfile.Profile()
            profiler.enable()
            try:
                result = await func(*args, **kwargs)
            finally:
                profiler.disable()
                stream = io.StringIO()
                stats = pstats.Stats(profiler, stream=stream)
                stats.strip_dirs()
                stats.sort_stats(sort_by)
                stats.print_stats(lines)
                logger.info(f"Profile for {func.__name__}:\n{stream.getvalue()}")
            return result
        return wrapper
    return decorator

# Usage - add temporarily to investigate a slow endpoint
@app.get("/api/courses")
@profile_endpoint(sort_by='tottime', lines=15)
async def list_courses():
    courses = await db.fetch_all(select(Course))
    return [course.to_dict() for course in courses]

Diagnosing the N+1 Query Problem

The most common performance bug in web applications, and cProfile reveals it clearly:

# Profile output for a slow /api/courses endpoint:
#
# ncalls  tottime  cumtime  filename:lineno(function)
#      1    0.000    2.550  views.py:10(list_courses)
#    500    0.010    2.500  serializers.py:5(serialize_course)
#    500    0.005    2.300  models.py:20(get_instructor)
#    501    2.200    2.200  asyncpg/connection.py:100(execute)
#    500    0.080    0.080  asyncpg/connection.py:150(fetchrow)
#
# Diagnosis:
# - 501 execute calls = 1 query for courses + 500 queries for instructors
# - Each serialize_course calls get_instructor which hits the DB
# - This is the classic N+1 problem

# Fix: eager load the relationship
# Before: SELECT * FROM courses; then 500x SELECT * FROM instructors WHERE id=?
# After:  SELECT * FROM courses JOIN instructors ON courses.instructor_id = instructors.id
#
# Post-fix profile:
# ncalls  tottime  cumtime  filename:lineno(function)
#      1    0.000    0.065  views.py:10(list_courses)
#    500    0.008    0.010  serializers.py:5(serialize_course)
#      1    0.050    0.050  asyncpg/connection.py:100(execute)
#
# 501 queries → 1 query. 2.55s → 0.065s. A 39x speedup.

Profiling in Automated Tests

import cProfile
import pstats
import pytest
from pathlib import Path

PROFILE_DIR = Path("profiles")

@pytest.fixture
def profile_test(request):
    """pytest fixture that profiles the test and saves results."""
    PROFILE_DIR.mkdir(exist_ok=True)
    profiler = cProfile.Profile()
    profiler.enable()

    yield profiler

    profiler.disable()
    test_name = request.node.name
    prof_file = PROFILE_DIR / f"{test_name}.prof"
    profiler.dump_stats(str(prof_file))

    # Print summary
    stats = pstats.Stats(profiler)
    stats.strip_dirs()
    stats.sort_stats('cumulative')
    stats.print_stats(10)

    # Fail test if too slow
    if stats.total_tt > 5.0:
        pytest.fail(f"{test_name} took {stats.total_tt:.2f}s (budget: 5.0s)")

# Usage
def test_course_listing(client, profile_test):
    response = client.get("/api/courses")
    assert response.status_code == 200
    assert len(response.json()) > 0

Part 8 - Building a Reusable Profiling Toolkit

Here is a production-quality toolkit that wraps the patterns from this lesson:

import cProfile
import pstats
import io
import time
from dataclasses import dataclass, field
from typing import Optional, Callable, Any
from pathlib import Path

@dataclass
class ProfileReport:
    """Structured profiling report for programmatic analysis."""
    function_name: str
    total_calls: int
    total_time: float
    wall_time: float
    overhead_pct: float
    top_tottime: str
    top_cumtime: str

    def summary(self) -> str:
        return (
            f"Profile: {self.function_name}\n"
            f"  Total calls: {self.total_calls:,}\n"
            f"  CPU time:    {self.total_time:.4f}s\n"
            f"  Wall time:   {self.wall_time:.4f}s\n"
            f"  Overhead:    ~{self.overhead_pct:.1f}%\n"
        )


def profile(func: Callable, *args: Any,
            save_to: Optional[str] = None,
            top_n: int = 15,
            **kwargs: Any) -> tuple[Any, ProfileReport]:
    """
    Profile a function call and return (result, ProfileReport).

    Usage:
        result, report = profile(my_function, arg1, arg2)
        print(report.summary())
    """
    # Measure wall time for overhead estimation
    wall_start = time.perf_counter()
    profiler = cProfile.Profile()
    profiler.enable()
    result = func(*args, **kwargs)
    profiler.disable()
    wall_time = time.perf_counter() - wall_start

    if save_to:
        profiler.dump_stats(save_to)

    # Extract tottime-sorted output
    stream = io.StringIO()
    stats = pstats.Stats(profiler, stream=stream)
    stats.strip_dirs()
    stats.sort_stats('tottime')
    stats.print_stats(top_n)
    top_tottime = stream.getvalue()

    # Extract cumtime-sorted output
    stream = io.StringIO()
    stats = pstats.Stats(profiler, stream=stream)
    stats.strip_dirs()
    stats.sort_stats('cumulative')
    stats.print_stats(top_n)
    top_cumtime = stream.getvalue()

    # Estimate overhead
    overhead = max(0, (wall_time - stats.total_tt) / stats.total_tt * 100)

    report = ProfileReport(
        function_name=func.__name__,
        total_calls=stats.total_calls,
        total_time=stats.total_tt,
        wall_time=wall_time,
        overhead_pct=overhead,
        top_tottime=top_tottime,
        top_cumtime=top_cumtime,
    )

    return result, report


def compare_profiles(baseline_file: str, current_file: str,
                     threshold: float = 0.10) -> dict:
    """
    Compare two saved profiles and report regressions.

    Returns a dict with comparison metrics and any violations.
    """
    baseline = pstats.Stats(baseline_file)
    current = pstats.Stats(current_file)

    time_change = (current.total_tt - baseline.total_tt) / baseline.total_tt
    call_change = (current.total_calls - baseline.total_calls) / baseline.total_calls

    result = {
        "baseline_time": baseline.total_tt,
        "current_time": current.total_tt,
        "time_change_pct": time_change * 100,
        "baseline_calls": baseline.total_calls,
        "current_calls": current.total_calls,
        "call_change_pct": call_change * 100,
        "violations": [],
    }

    if time_change > threshold:
        result["violations"].append(
            f"Time increased by {time_change:.1%} "
            f"({baseline.total_tt:.3f}s -> {current.total_tt:.3f}s)"
        )
    if call_change > threshold:
        result["violations"].append(
            f"Call count increased by {call_change:.1%} "
            f"({baseline.total_calls} -> {current.total_calls})"
        )

    return result

Key Takeaways

cProfile is deterministic: it traces every function call and return via PyEval_SetProfile. This gives exact call counts but adds per-call overhead.
tottime is where CPU burns; cumtime is total cost: when cumtime is high but tottime is low, the function itself is fast - it calls something slow. Follow the chain down.
pstats is the analysis layer: use sort_stats, print_stats, print_callers, and print_callees to navigate from symptom to root cause.
snakeviz makes profiles visual: the icicle view shows time proportions at a glance. Wide rectangles deep in the stack are your optimization targets.
Profiling overhead distorts timing: functions with millions of tiny calls appear disproportionately slow under cProfile. Always cross-check with wall-clock timing.
Profile in tests: automate profiling in your test suite to catch regressions before they reach production.
Never expose profiling in production without authentication: it leaks internal architecture and can be used as a DoS vector.

Graded Practice Challenges

Level 1 - Predict the Output

Question 1: Given this profile output, which function should you optimize first and why?

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  0.001    0.001    5.230    5.230 main.py:1(main)
  0.002    0.000    5.100    0.051 main.py:10(process_batch)
  0.050    0.000    5.000    0.001 main.py:20(process_item)
  4.900    0.000    4.900    0.000 main.py:30(compute_hash)
  0.040    0.000    0.040    0.000 main.py:40(validate)

Answer

Optimize compute_hash (line 30). It has the highest tottime at 4.900s out of 5.230s total - it consumes 93.7% of total execution time. By Amdahl's Law, this is the only function where optimization will have meaningful impact.

process_batch has high cumtime (5.100s) but near-zero tottime (0.002s) - it is just a loop that delegates to process_item. Optimizing process_batch itself would save 0.002s.

Question 2: What does ncalls = 120/3 mean in cProfile output?

Answer

It means the function was called 120 times total, of which 3 were primitive (non-recursive) calls. The function is recursive: it was entered from external code 3 times, and within those 3 invocations it called itself 117 additional times.

Question 3: You profile a function and see it takes 2.0s under cProfile. You time it with time.perf_counter() and it takes 0.8s. What explains the 1.2s discrepancy?

Answer

The 1.2s difference is profiling overhead. cProfile adds a fixed cost to every function call and return. If the function makes many internal function calls (e.g., calling helpers in a tight loop), each call incurs a profiling callback cost that accumulates.

The true execution time is closer to 0.8s. This is a fundamental limitation of deterministic profiling. For call-heavy code, consider a statistical profiler like py-spy or pyinstrument.

Level 2 - Debug Challenge

This profiling code is supposed to find the slowest endpoint, but it produces incorrect results. Find and fix all the bugs:

import cProfile
import pstats

def profile_endpoints(app, endpoints):
    """Profile each endpoint and rank by total time."""
    results = {}
    profiler = cProfile.Profile()

    for endpoint in endpoints:
        profiler.enable()
        app.test_client().get(endpoint)
        profiler.disable()

        stats = pstats.Stats(profiler)
        results[endpoint] = stats.total_tt

    ranked = sorted(results.items(), key=lambda x: x[1])
    for endpoint, t in ranked:
        print(f"{endpoint}: {t:.4f}s")

Answer

Three bugs:

Reused profiler: The same cProfile.Profile() instance accumulates stats across all endpoints. Each endpoint's total_tt includes all previous endpoints. Create a new profiler for each endpoint.
Sorted ascending: sorted defaults to ascending order, showing the fastest first. For finding the slowest, add reverse=True.
Missing strip_dirs: Minor readability issue - function names include full file paths.

Fixed version:

def profile_endpoints(app, endpoints):
    """Profile each endpoint and rank by total time."""
    results = {}

    for endpoint in endpoints:
        profiler = cProfile.Profile()  # Fresh profiler per endpoint
        profiler.enable()
        app.test_client().get(endpoint)
        profiler.disable()

        stats = pstats.Stats(profiler)
        results[endpoint] = stats.total_tt

    ranked = sorted(results.items(), key=lambda x: x[1], reverse=True)
    for endpoint, t in ranked:
        print(f"{endpoint}: {t:.4f}s")

Level 3 - Design Challenge

You are on-call and a production FastAPI service has degraded from p50=50ms to p50=800ms. You cannot attach a debugger, cannot add significant overhead, and cannot deploy new code immediately. Design a profiling strategy that:

Identifies the bottleneck without taking down the service
Works with async endpoints
Produces actionable data within 30 minutes
Does not require a code deployment

Describe your approach, tools, and the first three commands you would run.

Solution Sketch

Step 1: Triage (5 minutes)

Check APM dashboards for which endpoint(s) regressed
Check database slow query logs
Check resource metrics: CPU, memory, connection pool, event loop lag

Step 2: Statistical profiling with py-spy (10 minutes) py-spy attaches to a running process without code changes:

# See live top-like view of where time is spent
py-spy top --pid $(pgrep -f uvicorn)

# Record a flame graph over 60 seconds
py-spy record -o profile.svg --pid $(pgrep -f uvicorn) --duration 60

# Sample at higher frequency for more accuracy
py-spy record -o profile.svg --pid $(pgrep -f uvicorn) --rate 200

Step 3: Interpret the flame graph (10 minutes)

Look for wide plateaus - those are where time is spent
If dominated by socket.recv or asyncpg.execute - the database is the bottleneck
If dominated by CPU-bound Python code - identify the function and trace to source

Step 4: Confirm and fix (5 minutes)

Reproduce locally with production-like data
Profile locally with cProfile for exact call counts
Deploy the fix via normal CI/CD

Why not cProfile?

Requires code changes (deployment)
2-5x overhead could turn p50=800ms into p50=2000ms, making the outage worse
py-spy operates externally with ~1% overhead

What's Next

cProfile tells you which functions are slow, but not which lines within those functions. In line_profiler and memory_profiler, you will learn to profile at the line level - seeing exactly which statement consumes the most time or allocates the most memory.

What You Will Learn​

Prerequisites​

Part 1 - How cProfile Works​

cProfile vs profile​

Part 2 - Reading Profile Output​

Column Reference​

The ncalls Column: Recursive Calls​

Interpreting a Real Profile​

Part 3 - Three Ways to Use cProfile​

Method 1: Command Line​

Method 2: Programmatic Wrapping​

Method 3: Context Manager​

Method 4: Decorator​

Part 4 - pstats for Deep Analysis​

Sorting​

Filtering​

Caller/Callee Analysis​

Practical pstats Workflow​

Saving and Merging Profiles​

Part 5 - Visualization with snakeviz​

Programmatic Visualization​

Alternative Visualization: gprof2dot​

Part 6 - Deterministic vs Statistical Profiling​

When cProfile Lies​

Measuring Profiling Overhead​

Part 7 - Real-World: Profiling a FastAPI Endpoint​

Profiling Middleware​

Endpoint-Level Profiling Decorator​

Diagnosing the N+1 Query Problem​

Profiling in Automated Tests​

Part 8 - Building a Reusable Profiling Toolkit​

Key Takeaways​

Graded Practice Challenges​

Level 1 - Predict the Output​

Level 2 - Debug Challenge​

Level 3 - Design Challenge​

What's Next​

What You Will Learn

Prerequisites

Part 1 - How cProfile Works

cProfile vs profile

Part 2 - Reading Profile Output

Column Reference

The `ncalls` Column: Recursive Calls

Interpreting a Real Profile

Part 3 - Three Ways to Use cProfile

Method 1: Command Line

Method 2: Programmatic Wrapping

Method 3: Context Manager

Method 4: Decorator

Part 4 - pstats for Deep Analysis

Sorting

Filtering

Caller/Callee Analysis

Practical pstats Workflow

Saving and Merging Profiles

Part 5 - Visualization with snakeviz

Programmatic Visualization

Alternative Visualization: gprof2dot

Part 6 - Deterministic vs Statistical Profiling

When cProfile Lies

Measuring Profiling Overhead

Part 7 - Real-World: Profiling a FastAPI Endpoint

Profiling Middleware

Endpoint-Level Profiling Decorator

Diagnosing the N+1 Query Problem

Profiling in Automated Tests

Part 8 - Building a Reusable Profiling Toolkit

Key Takeaways

Graded Practice Challenges

Level 1 - Predict the Output

Level 2 - Debug Challenge

Level 3 - Design Challenge

What's Next