Python Profiling Strategy Practice Problems & Exercises

Practice: Profiling Strategy

11 problems4 Easy4 Medium3 Hard⏱ 60–90 min

Easy

#1timeit BasicsEasy

timeitbenchmarkingmicro-benchmark

Use timeit to compare the performance of a list comprehension vs map() for squaring numbers 0–999. Print which is faster.

import timeit

def bench_list_comp():
    return [x * x for x in range(1000)]

def bench_map():
    return list(map(lambda x: x * x, range(1000)))

n = 10000
lc_time = timeit.timeit(bench_list_comp, number=n)
map_time = timeit.timeit(bench_map, number=n)

# Print per-call times and which is faster
# Expected final line: "list comprehension is faster than map+list"

Solution

import timeit

def bench_list_comp():
    return [x * x for x in range(1000)]

def bench_map():
    return list(map(lambda x: x * x, range(1000)))

n = 10000
lc_time = timeit.timeit(bench_list_comp, number=n)
map_time = timeit.timeit(bench_map, number=n)

print(f"list comp:  {lc_time/n*1e6:.2f} us/call")
print(f"map+list:   {map_time/n*1e6:.2f} us/call")
winner = "list comprehension" if lc_time < map_time else "map+list"
other  = "map+list" if winner == "list comprehension" else "list comprehension"
print(f"{winner} is faster than {other}")

Key concepts:

number=10000 runs the function 10,000 times to average out OS scheduling noise.
Divide total time by number to get per-call time.
List comprehensions are typically faster than map + lambda because they use the specialized LIST_APPEND bytecode — no Python-level lambda dispatch.
Always run benchmarks multiple times; the first run may include interpreter warmup effects.

Expected Output

list comprehension is faster than map+list

Hints

Hint 1: Use timeit.timeit(fn, number=N) to run fn N times and return total elapsed seconds.

Hint 2: Divide by number to get per-call time, multiply by 1e6 to convert to microseconds.

#2time.perf_counter Manual TimingEasy

time.perf_countermanual-timingprofiling

Use time.perf_counter() to manually time how long it takes to build a dictionary of 100,000 squares.

import time

def build_squares(n):
    return {i: i * i for i in range(n)}

# Time build_squares(100_000) using perf_counter
# Print: "elapsed: X ms"  (round to 2 decimal places)

Solution

import time

def build_squares(n):
    return {i: i * i for i in range(n)}

start = time.perf_counter()
build_squares(100_000)
end = time.perf_counter()

elapsed_ms = (end - start) * 1000
print(f"elapsed: {elapsed_ms:.2f} ms")

Why perf_counter:

time.perf_counter() is the highest-resolution clock available on the platform.
It measures wall-clock time — includes I/O waits and OS scheduling.
For pure computation, wall time ≈ CPU time. For I/O-heavy code, use time.process_time() for CPU-only measurement.
perf_counter is monotonic — it never goes backwards, making it safe for elapsed-time calculations.

Expected Output

elapsed: X ms

Hints

Hint 1: time.perf_counter() returns a float of seconds with the highest available resolution.

Hint 2: Subtract the start value from the end value to get elapsed time.

#3Identify the HotspotEasy

profilinghotspotcProfilebottleneck

Given a main_work() function that calls both a fast and a slow sub-function, use cProfile to identify the hotspot.

import cProfile
import io
import pstats

def fast_work():
    return sum(range(100))

def slow_work():
    total = 0
    for i in range(50000):
        total += i * i
    return total

def main_work():
    results = [fast_work() for _ in range(10)]
    results.append(slow_work())
    return results

# Profile main_work, capture output, check if slow_work appears as hotspot
# Print: "Hotspot identified: slow_work accounts for most of the time"

Solution

import cProfile
import io
import pstats

def fast_work():
    return sum(range(100))

def slow_work():
    total = 0
    for i in range(50000):
        total += i * i
    return total

def main_work():
    results = [fast_work() for _ in range(10)]
    results.append(slow_work())
    return results

pr = cProfile.Profile()
pr.enable()
main_work()
pr.disable()

buf = io.StringIO()
ps = pstats.Stats(pr, stream=buf).sort_stats("tottime")
ps.print_stats(5)
output = buf.getvalue()

if "slow_work" in output:
    print("Hotspot identified: slow_work accounts for most of the time")
print(output)

Hotspot identification:

tottime = time spent exclusively in the function body (not counting callees).
cumtime = total time including all nested calls — useful for finding the "root" of slowness.
fast_work runs 10× but each call is trivial. slow_work runs once and dominates tottime.
Rule: optimize the function with the highest tottime first — not the most frequently called one.

Expected Output

Hotspot identified: slow_work accounts for most of the time

Hints

Hint 1: Use cProfile.Profile() as a context manager or call pr.enable()/pr.disable() around the target code.

Hint 2: Sort by tottime to see which function body (excluding callees) consumes the most CPU.

#4Comparing Algorithmic Complexity EmpiricallyEasy

complexityempiricaltimeitO-nO-n2

Empirically confirm the complexity of a linear vs quadratic function by timing them at increasing input sizes.

import timeit

def linear_sum(n):
    return sum(range(n))

def quadratic_pairs(n):
    count = 0
    for i in range(n):
        for j in range(n):
            count += 1
    return count

# Time both at n=100, 500, 2500 and print scaling ratios
# Print two lines:
# "O(n) scales linearly"
# "O(n^2) scales quadratically"

Solution

import timeit

def linear_sum(n):
    return sum(range(n))

def quadratic_pairs(n):
    count = 0
    for i in range(n):
        for j in range(n):
            count += 1
    return count

sizes = [100, 500, 2500]

linear_times = [
    timeit.timeit(lambda n=n: linear_sum(n), number=200) / 200 * 1e6
    for n in sizes
]
quad_times = [
    timeit.timeit(lambda n=n: quadratic_pairs(n), number=10) / 10 * 1e6
    for n in sizes
]

linear_ratio = linear_times[-1] / linear_times[0]
quad_ratio   = quad_times[-1]   / quad_times[0]

expected_linear = sizes[-1] / sizes[0]          # ~25x
expected_quad   = (sizes[-1] / sizes[0]) ** 2   # ~625x

print(f"linear: {linear_times[0]:.1f}us -> {linear_times[-1]:.1f}us  ratio={linear_ratio:.1f}x (expected ~{expected_linear}x)")
print(f"quad:   {quad_times[0]:.1f}us -> {quad_times[-1]:.1f}us  ratio={quad_ratio:.0f}x (expected ~{expected_quad:.0f}x)")
print("O(n) scales linearly")
print("O(n^2) scales quadratically")

Empirical complexity measurement:

Increase input size by 5× (100 → 500 → 2500). O(n) time grows ~5×; O(n^2) grows ~25×.
Real measurements are noisy — use timeit.repeat and take the minimum for cleaner ratios.
Always measure empirically even when you think you know the complexity — hidden constant factors and cache effects can surprise you.

Expected Output

O(n) scales linearly\nO(n^2) scales quadratically

Hints

Hint 1: Time both functions at multiple input sizes: n=100, 1000, 10000.

Hint 2: If doubling n doubles time, it is O(n). If doubling n quadruples time, it is O(n^2).

Medium

#5Profiling Hypothesis TestingMedium

profilinghypothesisoptimizationbefore-after

Apply the profiling methodology — measure, hypothesize, optimize, verify — on a string concatenation function.

import timeit

def original(words):
    result = ""
    for word in words:
        result += word + " "
    return result.strip()

def optimized(words):
    # Implement a faster version using str.join
    pass

words = ["python"] * 1000

# Verify both produce the same output, then time both
# Print: "Original: Xus  Optimized: Yus  Speedup: Zx"

Solution

import timeit

def original(words):
    result = ""
    for word in words:
        result += word + " "
    return result.strip()

def optimized(words):
    return " ".join(words)

words = ["python"] * 1000

assert original(words) == optimized(words), "correctness check failed"

n = 5000
orig_us = timeit.timeit(lambda: original(words),  number=n) / n * 1e6
opt_us  = timeit.timeit(lambda: optimized(words), number=n) / n * 1e6
speedup = orig_us / opt_us

print(f"Original:  {orig_us:.1f}us")
print(f"Optimized: {opt_us:.1f}us")
print(f"Speedup: {speedup:.1f}x")

Hypothesis-driven optimization:

String += in a loop is O(n^2) — each concatenation allocates a brand-new string object.
str.join() is O(n) — it measures total length once, allocates one buffer, then fills it.
Typical speedup for 1000-word list: 10–50× depending on string length.
Workflow: measure → identify hotspot → form specific hypothesis → implement → measure again → confirm.

Expected Output

Original: Xus  Optimized: Yus  Speedup: Zx

Hints

Hint 1: Profile first — confirm where time is spent. Only then form a hypothesis and optimize.

Hint 2: Verify correctness before measuring speedup. An incorrect optimization is not an optimization.

#6Benchmark Design: Avoiding JIT Warmup BiasMedium

benchmarkingwarmupJITpypytimeit

Build a benchmark that separates warmup runs from measurement runs and quantifies the difference.

import timeit
import time

def compute(n):
    """Some representative work."""
    result = 0
    for i in range(n):
        result += i % 7
    return result

def benchmark_cold(fn, n, number=1000):
    """No warmup — first runs included."""
    times = timeit.repeat(lambda: fn(n), repeat=5, number=number)
    return min(t / number * 1e6 for t in times)

def benchmark_warm(fn, n, warmup=5, number=1000):
    """Warmup runs discarded before measurement."""
    for _ in range(warmup):
        fn(n)
    times = timeit.repeat(lambda: fn(n), repeat=5, number=number)
    return min(t / number * 1e6 for t in times)

# Compare cold vs warm and print results
# Final line format: "warmup matters: True/False"

Solution

import timeit

def compute(n):
    result = 0
    for i in range(n):
        result += i % 7
    return result

def benchmark_cold(fn, n, number=1000):
    times = timeit.repeat(lambda: fn(n), repeat=5, number=number)
    return min(t / number * 1e6 for t in times)

def benchmark_warm(fn, n, warmup=5, number=1000):
    for _ in range(warmup):
        fn(n)
    times = timeit.repeat(lambda: fn(n), repeat=5, number=number)
    return min(t / number * 1e6 for t in times)

cold = benchmark_cold(compute, 10000)
warm = benchmark_warm(compute, 10000)

diff_pct = abs(cold - warm) / warm * 100
matters = diff_pct > 5.0

print(f"cold run: {cold:.2f}us  warm run: {warm:.2f}us  diff: {diff_pct:.1f}%")
print(f"warmup matters: {matters}")

Warmup effects in Python:

CPython caches .pyc bytecode on disk — second import is faster.
OS page cache: first access brings pages into RAM; subsequent accesses hit cache.
On PyPy/Graal, JIT compilation happens during warmup — cold runs are 10–100× slower.
Always warm up before measuring latency-sensitive code. Discard the first 3–5 runs.
Use timeit.repeat(repeat=5) and take min() — not mean() — as your performance target.

Expected Output

cold run: Xus  warm run: Yus  warmup matters: True/False

Hints

Hint 1: Run warmup iterations before your timed measurements and discard them.

Hint 2: First-call overhead includes bytecode compilation cache misses and interpreter setup.

#7Measuring Memory AllocationMedium

tracemallocmemory-profilingallocationsys.getsizeof

Use tracemalloc to compare the peak memory allocated by a list of 10,000 integers vs a dict mapping each integer to its square.

import tracemalloc

def build_list(n):
    return [i for i in range(n)]

def build_dict(n):
    return {i: i * i for i in range(n)}

N = 10000

# Measure peak allocation for build_list(N) and build_dict(N)
# Print:
# "list allocated: X KB"
# "dict allocated: Y KB"
# "larger structure: list" or "larger structure: dict"

Solution

import tracemalloc

def build_list(n):
    return [i for i in range(n)]

def build_dict(n):
    return {i: i * i for i in range(n)}

N = 10000

tracemalloc.start()
build_list(N)
_, list_peak = tracemalloc.get_traced_memory()
tracemalloc.clear_traces()

build_dict(N)
_, dict_peak = tracemalloc.get_traced_memory()
tracemalloc.stop()

list_kb = list_peak / 1024
dict_kb  = dict_peak / 1024
larger   = "dict" if dict_peak > list_peak else "list"

print(f"list allocated: {list_kb:.1f} KB")
print(f"dict allocated: {dict_kb:.1f} KB")
print(f"larger structure: {larger}")

tracemalloc usage:

tracemalloc.get_traced_memory() returns (current, peak) in bytes since tracemalloc.start().
tracemalloc.clear_traces() resets counters — use between measurements.
A list of N ints uses ~8 bytes per pointer + object overhead; a dict uses ~50–70 bytes per entry (hash table overhead).
Dicts are typically 3–5× larger than lists for the same number of integers.

Expected Output

list allocated: X KB\ndict allocated: Y KB\nlarger structure: list/dict

Hints

Hint 1: Use tracemalloc.start() / tracemalloc.stop() and tracemalloc.get_traced_memory() to measure peak allocation.

Hint 2: tracemalloc.clear_traces() between measurements resets the counters.

#8Amdahl's Law — Theoretical Speedup CalculatorMedium

Amdahl's lawspeedupparallelismperformance-model

Implement amdahls_speedup(serial_fraction, num_processors) and use it to show why small serial fractions are critical.

def amdahls_speedup(serial_fraction, num_processors):
    """Return theoretical speedup per Amdahl's law."""
    pass

# Test with serial_fraction=0.2 at N=1,2,4,8,16,32,infinity
# Print: "Serial fraction 0.2: max speedup = 5.0x"
# Test with serial_fraction=0.05
# Print: "Serial fraction 0.05: max speedup = 20.0x"

Solution

def amdahls_speedup(serial_fraction, num_processors):
    if num_processors == float("inf"):
        return 1.0 / serial_fraction
    parallel_fraction = 1.0 - serial_fraction
    return 1.0 / (serial_fraction + parallel_fraction / num_processors)

for serial in [0.2, 0.05]:
    processors = [1, 2, 4, 8, 16, 32]
    speedups = [amdahls_speedup(serial, n) for n in processors]
    max_speedup = amdahls_speedup(serial, float("inf"))
    print(f"\nSerial fraction {serial}:")
    for n, s in zip(processors, speedups):
        print(f"  N={n:2d}: {s:.2f}x")
    print(f"Serial fraction {serial}: max speedup = {max_speedup:.1f}x")

Amdahl's Law implications:

If 20% of a program is serial (cannot be parallelized), the maximum possible speedup is 5× — even with infinite cores.
Reducing the serial fraction from 20% to 5% raises the ceiling from 5× to 20×.
Profiling's job is to find and reduce the serial bottleneck, not just add more cores.
Corollary: adding hardware is often cheaper than hiring engineers — but Amdahl sets the hard ceiling.

Expected Output

Serial fraction 0.2: max speedup = 5.0x\nSerial fraction 0.05: max speedup = 20.0x

Hints

Hint 1: Amdahl's law: S = 1 / (F_serial + (1 - F_serial) / N) where F_serial is the fraction that cannot be parallelized and N is the number of processors.

Hint 2: As N approaches infinity, max speedup = 1 / F_serial.

Hard

#9Benchmark Harness with GC Control and Statistical ReportingHard

benchmarkinggcstatisticsproduction-benchmark

Build a production-quality benchmark harness: warmup, GC disabled during measurement, statistical reporting.

import timeit
import gc
import statistics

def benchmark(fn, warmup=3, repeat=7, number=1000):
    """
    Production benchmark:
    1. Warmup (not measured)
    2. GC disabled during timed section
    3. Return dict with min_us, mean_us, stdev_us, samples
    """
    pass

def target():
    return sorted([3, 1, 4, 1, 5, 9, 2, 6] * 100)

result = benchmark(target)
print(f"min={result['min_us']:.2f}us  mean={result['mean_us']:.2f}us  stdev={result['stdev_us']:.3f}us  n={result['samples']}")

Solution

import timeit
import gc
import statistics

def benchmark(fn, warmup=3, repeat=7, number=1000):
    for _ in range(warmup):
        fn()

    gc.disable()
    try:
        times = timeit.repeat(fn, repeat=repeat, number=number)
    finally:
        gc.enable()

    per_call_us = [t / number * 1e6 for t in times]
    return {
        "min_us":   min(per_call_us),
        "mean_us":  statistics.mean(per_call_us),
        "stdev_us": statistics.stdev(per_call_us),
        "samples":  len(per_call_us),
    }

def target():
    return sorted([3, 1, 4, 1, 5, 9, 2, 6] * 100)

result = benchmark(target)
print(f"min={result['min_us']:.2f}us  mean={result['mean_us']:.2f}us  stdev={result['stdev_us']:.3f}us  n={result['samples']}")

Production benchmark best practices:

GC disabled during measurement prevents unpredictable GC pauses from inflating individual samples. Always re-enable in finally.
Warmup: populates OS page cache, Python bytecode cache, and JIT (on PyPy/GraalPy). Warmup samples are discarded.
Report min as your performance target — it represents the cleanest execution with least interference.
stdev reveals consistency: high stdev means results are noisy (OS scheduling, thermal throttling).
Additional best practices for critical benchmarks: pin the process to a CPU core (taskset -c 0 on Linux), disable CPU frequency scaling.

Expected Output

min=Xus  mean=Yus  stdev=Zus  n=7

Hints

Hint 1: Disable GC during measurement with gc.disable() and always re-enable in a finally block.

Hint 2: Use timeit.repeat(repeat=7) for 7 independent samples. Report min (best case) and stdev (noise level).

#10Profiling a Real Algorithm — O(n^2) vs O(n)Hard

profilingalgorithmset-lookupoptimizationbefore-after

Profile find_duplicates_naive (O(n^2) list search) vs find_duplicates_fast (O(n) set lookup). Quantify the speedup and explain the profiling evidence.

import timeit
import cProfile
import io
import pstats

def find_duplicates_naive(items):
    seen = []
    duplicates = []
    for item in items:
        if item in seen:
            duplicates.append(item)
        else:
            seen.append(item)
    return duplicates

def find_duplicates_fast(items):
    seen = set()
    duplicates = []
    for item in items:
        if item in seen:
            duplicates.append(item)
        else:
            seen.add(item)
    return duplicates

data = list(range(2000)) + list(range(1000))

# 1. Assert both produce the same result
# 2. Time both at number=200, report ms per call
# 3. Print speedup
# Output format:
# "O(n^2): Xms"
# "O(n): Yms"
# "Speedup: Zx"

Solution

import timeit

def find_duplicates_naive(items):
    seen = []
    duplicates = []
    for item in items:
        if item in seen:
            duplicates.append(item)
        else:
            seen.append(item)
    return duplicates

def find_duplicates_fast(items):
    seen = set()
    duplicates = []
    for item in items:
        if item in seen:
            duplicates.append(item)
        else:
            seen.add(item)
    return duplicates

data = list(range(2000)) + list(range(1000))

assert sorted(find_duplicates_naive(data)) == sorted(find_duplicates_fast(data))

n = 200
naive_ms = timeit.timeit(lambda: find_duplicates_naive(data), number=n) / n * 1000
fast_ms  = timeit.timeit(lambda: find_duplicates_fast(data),  number=n) / n * 1000

print(f"O(n^2): {naive_ms:.2f}ms")
print(f"O(n):   {fast_ms:.3f}ms")
print(f"Speedup: {naive_ms/fast_ms:.0f}x")

Data structure impact on algorithmic complexity:

list.__contains__ is O(n): scans from the start on every check.
set.__contains__ is O(1) average: hash-based lookup.
For 3000 items, naive does ~4.5M comparisons; fast does ~3000 hash lookups.
Profiling find_duplicates_naive with cProfile shows list.__contains__ dominating tottime.
Key lesson: data structure choice (list vs set) has more impact than any micro-optimization.

Expected Output

O(n^2): Xms\nO(n): Yms\nSpeedup: Zx

Hints

Hint 1: Profile both versions with cProfile to see which internal operation dominates tottime.

Hint 2: list.__contains__ is O(n); set.__contains__ is O(1). For 3000 items, the difference is 3000 comparisons vs 1 hash lookup.

#11Performance Regression Test with Budget AssertionHard

regression-testperformance-budgetCIbenchmarking

Write a PerformanceBudget class that asserts a function finishes within a time budget and raises AssertionError with a clear message when it does not.

import timeit

class PerformanceBudget:
    def __init__(self, max_us, name=""):
        self.max_us = max_us
        self.name   = name

    def assert_within_budget(self, fn, warmup=2, repeat=5, number=500):
        """
        Warm up fn, then measure best time.
        Raise AssertionError if best_us > self.max_us.
        Return best_us if within budget.
        """
        pass

def efficient_sum(n):
    return n * (n - 1) // 2

budget = PerformanceBudget(max_us=10.0, name="efficient_sum")
result = budget.assert_within_budget(lambda: efficient_sum(10000))
print(f"Performance within budget: {result:.2f}us")

tight = PerformanceBudget(max_us=0.001, name="artificially_tight")
try:
    tight.assert_within_budget(lambda: efficient_sum(10000))
except AssertionError:
    print("Regression detected when threshold exceeded")

Solution

import timeit

class PerformanceBudget:
    def __init__(self, max_us, name=""):
        self.max_us = max_us
        self.name   = name

    def assert_within_budget(self, fn, warmup=2, repeat=5, number=500):
        for _ in range(warmup):
            fn()
        times = timeit.repeat(fn, repeat=repeat, number=number)
        best_us = min(t / number * 1e6 for t in times)
        if best_us > self.max_us:
            raise AssertionError(
                f"PERFORMANCE REGRESSION: {self.name} took {best_us:.2f}us "
                f"(budget: {self.max_us}us)"
            )
        return best_us

def efficient_sum(n):
    return n * (n - 1) // 2

budget = PerformanceBudget(max_us=10.0, name="efficient_sum")
result = budget.assert_within_budget(lambda: efficient_sum(10000))
print(f"Performance within budget: {result:.2f}us")

tight = PerformanceBudget(max_us=0.001, name="artificially_tight")
try:
    tight.assert_within_budget(lambda: efficient_sum(10000))
except AssertionError as e:
    print("Regression detected when threshold exceeded")

Performance regression testing in CI:

Use min() over repeated samples to get the best-achievable time — minimizes false positives from CI load.
Set budget conservatively: if the function typically takes 2us, set the budget to 10us for headroom.
Integrate with pytest: pytest-benchmark provides rich CI reporting and baseline tracking.
Store baselines in CI artifacts and compare on each PR to detect regressions automatically.
Budget tightness should match criticality: hot paths need tighter budgets than infrequent operations.

Expected Output

Performance within budget: Xus\nRegression detected when threshold exceeded

Hints

Hint 1: Define a maximum acceptable time (budget). Raise AssertionError if the function exceeds it.

Hint 2: Use timeit.repeat and take the minimum to avoid flaky failures from OS scheduling noise.

Practice: Profiling Strategy

Easy​

Medium​

Hard​

Easy

Medium

Hard