Skip to main content

Python Profiling Strategy Practice Problems & Exercises

Practice: Profiling Strategy

11 problems4 Easy4 Medium3 Hard60–90 min
← Back to lesson

Easy

#1timeit BasicsEasy
timeitbenchmarkingmicro-benchmark

Use timeit to compare the performance of a list comprehension vs map() for squaring numbers 0–999. Print which is faster.

import timeit

def bench_list_comp():
return [x * x for x in range(1000)]

def bench_map():
return list(map(lambda x: x * x, range(1000)))

n = 10000
lc_time = timeit.timeit(bench_list_comp, number=n)
map_time = timeit.timeit(bench_map, number=n)

# Print per-call times and which is faster
# Expected final line: "list comprehension is faster than map+list"
Solution
import timeit

def bench_list_comp():
return [x * x for x in range(1000)]

def bench_map():
return list(map(lambda x: x * x, range(1000)))

n = 10000
lc_time = timeit.timeit(bench_list_comp, number=n)
map_time = timeit.timeit(bench_map, number=n)

print(f"list comp: {lc_time/n*1e6:.2f} us/call")
print(f"map+list: {map_time/n*1e6:.2f} us/call")
winner = "list comprehension" if lc_time < map_time else "map+list"
other = "map+list" if winner == "list comprehension" else "list comprehension"
print(f"{winner} is faster than {other}")

Key concepts:

  • number=10000 runs the function 10,000 times to average out OS scheduling noise.
  • Divide total time by number to get per-call time.
  • List comprehensions are typically faster than map + lambda because they use the specialized LIST_APPEND bytecode — no Python-level lambda dispatch.
  • Always run benchmarks multiple times; the first run may include interpreter warmup effects.
Expected Output
list comprehension is faster than map+list
Hints

Hint 1: Use timeit.timeit(fn, number=N) to run fn N times and return total elapsed seconds.

Hint 2: Divide by number to get per-call time, multiply by 1e6 to convert to microseconds.

#2time.perf_counter Manual TimingEasy
time.perf_countermanual-timingprofiling

Use time.perf_counter() to manually time how long it takes to build a dictionary of 100,000 squares.

import time

def build_squares(n):
return {i: i * i for i in range(n)}

# Time build_squares(100_000) using perf_counter
# Print: "elapsed: X ms" (round to 2 decimal places)
Solution
import time

def build_squares(n):
return {i: i * i for i in range(n)}

start = time.perf_counter()
build_squares(100_000)
end = time.perf_counter()

elapsed_ms = (end - start) * 1000
print(f"elapsed: {elapsed_ms:.2f} ms")

Why perf_counter:

  • time.perf_counter() is the highest-resolution clock available on the platform.
  • It measures wall-clock time — includes I/O waits and OS scheduling.
  • For pure computation, wall time ≈ CPU time. For I/O-heavy code, use time.process_time() for CPU-only measurement.
  • perf_counter is monotonic — it never goes backwards, making it safe for elapsed-time calculations.
Expected Output
elapsed: X ms
Hints

Hint 1: time.perf_counter() returns a float of seconds with the highest available resolution.

Hint 2: Subtract the start value from the end value to get elapsed time.

#3Identify the HotspotEasy
profilinghotspotcProfilebottleneck

Given a main_work() function that calls both a fast and a slow sub-function, use cProfile to identify the hotspot.

import cProfile
import io
import pstats

def fast_work():
return sum(range(100))

def slow_work():
total = 0
for i in range(50000):
total += i * i
return total

def main_work():
results = [fast_work() for _ in range(10)]
results.append(slow_work())
return results

# Profile main_work, capture output, check if slow_work appears as hotspot
# Print: "Hotspot identified: slow_work accounts for most of the time"
Solution
import cProfile
import io
import pstats

def fast_work():
return sum(range(100))

def slow_work():
total = 0
for i in range(50000):
total += i * i
return total

def main_work():
results = [fast_work() for _ in range(10)]
results.append(slow_work())
return results

pr = cProfile.Profile()
pr.enable()
main_work()
pr.disable()

buf = io.StringIO()
ps = pstats.Stats(pr, stream=buf).sort_stats("tottime")
ps.print_stats(5)
output = buf.getvalue()

if "slow_work" in output:
print("Hotspot identified: slow_work accounts for most of the time")
print(output)

Hotspot identification:

  • tottime = time spent exclusively in the function body (not counting callees).
  • cumtime = total time including all nested calls — useful for finding the "root" of slowness.
  • fast_work runs 10× but each call is trivial. slow_work runs once and dominates tottime.
  • Rule: optimize the function with the highest tottime first — not the most frequently called one.
Expected Output
Hotspot identified: slow_work accounts for most of the time
Hints

Hint 1: Use cProfile.Profile() as a context manager or call pr.enable()/pr.disable() around the target code.

Hint 2: Sort by tottime to see which function body (excluding callees) consumes the most CPU.

#4Comparing Algorithmic Complexity EmpiricallyEasy
complexityempiricaltimeitO-nO-n2

Empirically confirm the complexity of a linear vs quadratic function by timing them at increasing input sizes.

import timeit

def linear_sum(n):
return sum(range(n))

def quadratic_pairs(n):
count = 0
for i in range(n):
for j in range(n):
count += 1
return count

# Time both at n=100, 500, 2500 and print scaling ratios
# Print two lines:
# "O(n) scales linearly"
# "O(n^2) scales quadratically"
Solution
import timeit

def linear_sum(n):
return sum(range(n))

def quadratic_pairs(n):
count = 0
for i in range(n):
for j in range(n):
count += 1
return count

sizes = [100, 500, 2500]

linear_times = [
timeit.timeit(lambda n=n: linear_sum(n), number=200) / 200 * 1e6
for n in sizes
]
quad_times = [
timeit.timeit(lambda n=n: quadratic_pairs(n), number=10) / 10 * 1e6
for n in sizes
]

linear_ratio = linear_times[-1] / linear_times[0]
quad_ratio = quad_times[-1] / quad_times[0]

expected_linear = sizes[-1] / sizes[0] # ~25x
expected_quad = (sizes[-1] / sizes[0]) ** 2 # ~625x

print(f"linear: {linear_times[0]:.1f}us -> {linear_times[-1]:.1f}us ratio={linear_ratio:.1f}x (expected ~{expected_linear}x)")
print(f"quad: {quad_times[0]:.1f}us -> {quad_times[-1]:.1f}us ratio={quad_ratio:.0f}x (expected ~{expected_quad:.0f}x)")
print("O(n) scales linearly")
print("O(n^2) scales quadratically")

Empirical complexity measurement:

  • Increase input size by 5× (100 → 500 → 2500). O(n) time grows ~5×; O(n^2) grows ~25×.
  • Real measurements are noisy — use timeit.repeat and take the minimum for cleaner ratios.
  • Always measure empirically even when you think you know the complexity — hidden constant factors and cache effects can surprise you.
Expected Output
O(n) scales linearly\nO(n^2) scales quadratically
Hints

Hint 1: Time both functions at multiple input sizes: n=100, 1000, 10000.

Hint 2: If doubling n doubles time, it is O(n). If doubling n quadruples time, it is O(n^2).


Medium

#5Profiling Hypothesis TestingMedium
profilinghypothesisoptimizationbefore-after

Apply the profiling methodology — measure, hypothesize, optimize, verify — on a string concatenation function.

import timeit

def original(words):
result = ""
for word in words:
result += word + " "
return result.strip()

def optimized(words):
# Implement a faster version using str.join
pass

words = ["python"] * 1000

# Verify both produce the same output, then time both
# Print: "Original: Xus Optimized: Yus Speedup: Zx"
Solution
import timeit

def original(words):
result = ""
for word in words:
result += word + " "
return result.strip()

def optimized(words):
return " ".join(words)

words = ["python"] * 1000

assert original(words) == optimized(words), "correctness check failed"

n = 5000
orig_us = timeit.timeit(lambda: original(words), number=n) / n * 1e6
opt_us = timeit.timeit(lambda: optimized(words), number=n) / n * 1e6
speedup = orig_us / opt_us

print(f"Original: {orig_us:.1f}us")
print(f"Optimized: {opt_us:.1f}us")
print(f"Speedup: {speedup:.1f}x")

Hypothesis-driven optimization:

  • String += in a loop is O(n^2) — each concatenation allocates a brand-new string object.
  • str.join() is O(n) — it measures total length once, allocates one buffer, then fills it.
  • Typical speedup for 1000-word list: 10–50× depending on string length.
  • Workflow: measure → identify hotspot → form specific hypothesis → implement → measure again → confirm.
Expected Output
Original: Xus  Optimized: Yus  Speedup: Zx
Hints

Hint 1: Profile first — confirm where time is spent. Only then form a hypothesis and optimize.

Hint 2: Verify correctness before measuring speedup. An incorrect optimization is not an optimization.

#6Benchmark Design: Avoiding JIT Warmup BiasMedium
benchmarkingwarmupJITpypytimeit

Build a benchmark that separates warmup runs from measurement runs and quantifies the difference.

import timeit
import time

def compute(n):
"""Some representative work."""
result = 0
for i in range(n):
result += i % 7
return result

def benchmark_cold(fn, n, number=1000):
"""No warmup — first runs included."""
times = timeit.repeat(lambda: fn(n), repeat=5, number=number)
return min(t / number * 1e6 for t in times)

def benchmark_warm(fn, n, warmup=5, number=1000):
"""Warmup runs discarded before measurement."""
for _ in range(warmup):
fn(n)
times = timeit.repeat(lambda: fn(n), repeat=5, number=number)
return min(t / number * 1e6 for t in times)

# Compare cold vs warm and print results
# Final line format: "warmup matters: True/False"
Solution
import timeit

def compute(n):
result = 0
for i in range(n):
result += i % 7
return result

def benchmark_cold(fn, n, number=1000):
times = timeit.repeat(lambda: fn(n), repeat=5, number=number)
return min(t / number * 1e6 for t in times)

def benchmark_warm(fn, n, warmup=5, number=1000):
for _ in range(warmup):
fn(n)
times = timeit.repeat(lambda: fn(n), repeat=5, number=number)
return min(t / number * 1e6 for t in times)

cold = benchmark_cold(compute, 10000)
warm = benchmark_warm(compute, 10000)

diff_pct = abs(cold - warm) / warm * 100
matters = diff_pct > 5.0

print(f"cold run: {cold:.2f}us warm run: {warm:.2f}us diff: {diff_pct:.1f}%")
print(f"warmup matters: {matters}")

Warmup effects in Python:

  • CPython caches .pyc bytecode on disk — second import is faster.
  • OS page cache: first access brings pages into RAM; subsequent accesses hit cache.
  • On PyPy/Graal, JIT compilation happens during warmup — cold runs are 10–100× slower.
  • Always warm up before measuring latency-sensitive code. Discard the first 3–5 runs.
  • Use timeit.repeat(repeat=5) and take min() — not mean() — as your performance target.
Expected Output
cold run: Xus  warm run: Yus  warmup matters: True/False
Hints

Hint 1: Run warmup iterations before your timed measurements and discard them.

Hint 2: First-call overhead includes bytecode compilation cache misses and interpreter setup.

#7Measuring Memory AllocationMedium
tracemallocmemory-profilingallocationsys.getsizeof

Use tracemalloc to compare the peak memory allocated by a list of 10,000 integers vs a dict mapping each integer to its square.

import tracemalloc

def build_list(n):
return [i for i in range(n)]

def build_dict(n):
return {i: i * i for i in range(n)}

N = 10000

# Measure peak allocation for build_list(N) and build_dict(N)
# Print:
# "list allocated: X KB"
# "dict allocated: Y KB"
# "larger structure: list" or "larger structure: dict"
Solution
import tracemalloc

def build_list(n):
return [i for i in range(n)]

def build_dict(n):
return {i: i * i for i in range(n)}

N = 10000

tracemalloc.start()
build_list(N)
_, list_peak = tracemalloc.get_traced_memory()
tracemalloc.clear_traces()

build_dict(N)
_, dict_peak = tracemalloc.get_traced_memory()
tracemalloc.stop()

list_kb = list_peak / 1024
dict_kb = dict_peak / 1024
larger = "dict" if dict_peak > list_peak else "list"

print(f"list allocated: {list_kb:.1f} KB")
print(f"dict allocated: {dict_kb:.1f} KB")
print(f"larger structure: {larger}")

tracemalloc usage:

  • tracemalloc.get_traced_memory() returns (current, peak) in bytes since tracemalloc.start().
  • tracemalloc.clear_traces() resets counters — use between measurements.
  • A list of N ints uses ~8 bytes per pointer + object overhead; a dict uses ~50–70 bytes per entry (hash table overhead).
  • Dicts are typically 3–5× larger than lists for the same number of integers.
Expected Output
list allocated: X KB\ndict allocated: Y KB\nlarger structure: list/dict
Hints

Hint 1: Use tracemalloc.start() / tracemalloc.stop() and tracemalloc.get_traced_memory() to measure peak allocation.

Hint 2: tracemalloc.clear_traces() between measurements resets the counters.

#8Amdahl's Law — Theoretical Speedup CalculatorMedium
Amdahl's lawspeedupparallelismperformance-model

Implement amdahls_speedup(serial_fraction, num_processors) and use it to show why small serial fractions are critical.

def amdahls_speedup(serial_fraction, num_processors):
"""Return theoretical speedup per Amdahl's law."""
pass

# Test with serial_fraction=0.2 at N=1,2,4,8,16,32,infinity
# Print: "Serial fraction 0.2: max speedup = 5.0x"
# Test with serial_fraction=0.05
# Print: "Serial fraction 0.05: max speedup = 20.0x"
Solution
def amdahls_speedup(serial_fraction, num_processors):
if num_processors == float("inf"):
return 1.0 / serial_fraction
parallel_fraction = 1.0 - serial_fraction
return 1.0 / (serial_fraction + parallel_fraction / num_processors)

for serial in [0.2, 0.05]:
processors = [1, 2, 4, 8, 16, 32]
speedups = [amdahls_speedup(serial, n) for n in processors]
max_speedup = amdahls_speedup(serial, float("inf"))
print(f"\nSerial fraction {serial}:")
for n, s in zip(processors, speedups):
print(f" N={n:2d}: {s:.2f}x")
print(f"Serial fraction {serial}: max speedup = {max_speedup:.1f}x")

Amdahl's Law implications:

  • If 20% of a program is serial (cannot be parallelized), the maximum possible speedup is 5× — even with infinite cores.
  • Reducing the serial fraction from 20% to 5% raises the ceiling from 5× to 20×.
  • Profiling's job is to find and reduce the serial bottleneck, not just add more cores.
  • Corollary: adding hardware is often cheaper than hiring engineers — but Amdahl sets the hard ceiling.
Expected Output
Serial fraction 0.2: max speedup = 5.0x\nSerial fraction 0.05: max speedup = 20.0x
Hints

Hint 1: Amdahl's law: S = 1 / (F_serial + (1 - F_serial) / N) where F_serial is the fraction that cannot be parallelized and N is the number of processors.

Hint 2: As N approaches infinity, max speedup = 1 / F_serial.


Hard

#9Benchmark Harness with GC Control and Statistical ReportingHard
benchmarkinggcstatisticsproduction-benchmark

Build a production-quality benchmark harness: warmup, GC disabled during measurement, statistical reporting.

import timeit
import gc
import statistics

def benchmark(fn, warmup=3, repeat=7, number=1000):
"""
Production benchmark:
1. Warmup (not measured)
2. GC disabled during timed section
3. Return dict with min_us, mean_us, stdev_us, samples
"""
pass

def target():
return sorted([3, 1, 4, 1, 5, 9, 2, 6] * 100)

result = benchmark(target)
print(f"min={result['min_us']:.2f}us mean={result['mean_us']:.2f}us stdev={result['stdev_us']:.3f}us n={result['samples']}")
Solution
import timeit
import gc
import statistics

def benchmark(fn, warmup=3, repeat=7, number=1000):
for _ in range(warmup):
fn()

gc.disable()
try:
times = timeit.repeat(fn, repeat=repeat, number=number)
finally:
gc.enable()

per_call_us = [t / number * 1e6 for t in times]
return {
"min_us": min(per_call_us),
"mean_us": statistics.mean(per_call_us),
"stdev_us": statistics.stdev(per_call_us),
"samples": len(per_call_us),
}

def target():
return sorted([3, 1, 4, 1, 5, 9, 2, 6] * 100)

result = benchmark(target)
print(f"min={result['min_us']:.2f}us mean={result['mean_us']:.2f}us stdev={result['stdev_us']:.3f}us n={result['samples']}")

Production benchmark best practices:

  • GC disabled during measurement prevents unpredictable GC pauses from inflating individual samples. Always re-enable in finally.
  • Warmup: populates OS page cache, Python bytecode cache, and JIT (on PyPy/GraalPy). Warmup samples are discarded.
  • Report min as your performance target — it represents the cleanest execution with least interference.
  • stdev reveals consistency: high stdev means results are noisy (OS scheduling, thermal throttling).
  • Additional best practices for critical benchmarks: pin the process to a CPU core (taskset -c 0 on Linux), disable CPU frequency scaling.
Expected Output
min=Xus  mean=Yus  stdev=Zus  n=7
Hints

Hint 1: Disable GC during measurement with gc.disable() and always re-enable in a finally block.

Hint 2: Use timeit.repeat(repeat=7) for 7 independent samples. Report min (best case) and stdev (noise level).

#10Profiling a Real Algorithm — O(n^2) vs O(n)Hard
profilingalgorithmset-lookupoptimizationbefore-after

Profile find_duplicates_naive (O(n^2) list search) vs find_duplicates_fast (O(n) set lookup). Quantify the speedup and explain the profiling evidence.

import timeit
import cProfile
import io
import pstats

def find_duplicates_naive(items):
seen = []
duplicates = []
for item in items:
if item in seen:
duplicates.append(item)
else:
seen.append(item)
return duplicates

def find_duplicates_fast(items):
seen = set()
duplicates = []
for item in items:
if item in seen:
duplicates.append(item)
else:
seen.add(item)
return duplicates

data = list(range(2000)) + list(range(1000))

# 1. Assert both produce the same result
# 2. Time both at number=200, report ms per call
# 3. Print speedup
# Output format:
# "O(n^2): Xms"
# "O(n): Yms"
# "Speedup: Zx"
Solution
import timeit

def find_duplicates_naive(items):
seen = []
duplicates = []
for item in items:
if item in seen:
duplicates.append(item)
else:
seen.append(item)
return duplicates

def find_duplicates_fast(items):
seen = set()
duplicates = []
for item in items:
if item in seen:
duplicates.append(item)
else:
seen.add(item)
return duplicates

data = list(range(2000)) + list(range(1000))

assert sorted(find_duplicates_naive(data)) == sorted(find_duplicates_fast(data))

n = 200
naive_ms = timeit.timeit(lambda: find_duplicates_naive(data), number=n) / n * 1000
fast_ms = timeit.timeit(lambda: find_duplicates_fast(data), number=n) / n * 1000

print(f"O(n^2): {naive_ms:.2f}ms")
print(f"O(n): {fast_ms:.3f}ms")
print(f"Speedup: {naive_ms/fast_ms:.0f}x")

Data structure impact on algorithmic complexity:

  • list.__contains__ is O(n): scans from the start on every check.
  • set.__contains__ is O(1) average: hash-based lookup.
  • For 3000 items, naive does ~4.5M comparisons; fast does ~3000 hash lookups.
  • Profiling find_duplicates_naive with cProfile shows list.__contains__ dominating tottime.
  • Key lesson: data structure choice (list vs set) has more impact than any micro-optimization.
Expected Output
O(n^2): Xms\nO(n): Yms\nSpeedup: Zx
Hints

Hint 1: Profile both versions with cProfile to see which internal operation dominates tottime.

Hint 2: list.__contains__ is O(n); set.__contains__ is O(1). For 3000 items, the difference is 3000 comparisons vs 1 hash lookup.

#11Performance Regression Test with Budget AssertionHard
regression-testperformance-budgetCIbenchmarking

Write a PerformanceBudget class that asserts a function finishes within a time budget and raises AssertionError with a clear message when it does not.

import timeit

class PerformanceBudget:
def __init__(self, max_us, name=""):
self.max_us = max_us
self.name = name

def assert_within_budget(self, fn, warmup=2, repeat=5, number=500):
"""
Warm up fn, then measure best time.
Raise AssertionError if best_us > self.max_us.
Return best_us if within budget.
"""
pass

def efficient_sum(n):
return n * (n - 1) // 2

budget = PerformanceBudget(max_us=10.0, name="efficient_sum")
result = budget.assert_within_budget(lambda: efficient_sum(10000))
print(f"Performance within budget: {result:.2f}us")

tight = PerformanceBudget(max_us=0.001, name="artificially_tight")
try:
tight.assert_within_budget(lambda: efficient_sum(10000))
except AssertionError:
print("Regression detected when threshold exceeded")
Solution
import timeit

class PerformanceBudget:
def __init__(self, max_us, name=""):
self.max_us = max_us
self.name = name

def assert_within_budget(self, fn, warmup=2, repeat=5, number=500):
for _ in range(warmup):
fn()
times = timeit.repeat(fn, repeat=repeat, number=number)
best_us = min(t / number * 1e6 for t in times)
if best_us > self.max_us:
raise AssertionError(
f"PERFORMANCE REGRESSION: {self.name} took {best_us:.2f}us "
f"(budget: {self.max_us}us)"
)
return best_us

def efficient_sum(n):
return n * (n - 1) // 2

budget = PerformanceBudget(max_us=10.0, name="efficient_sum")
result = budget.assert_within_budget(lambda: efficient_sum(10000))
print(f"Performance within budget: {result:.2f}us")

tight = PerformanceBudget(max_us=0.001, name="artificially_tight")
try:
tight.assert_within_budget(lambda: efficient_sum(10000))
except AssertionError as e:
print("Regression detected when threshold exceeded")

Performance regression testing in CI:

  • Use min() over repeated samples to get the best-achievable time — minimizes false positives from CI load.
  • Set budget conservatively: if the function typically takes 2us, set the budget to 10us for headroom.
  • Integrate with pytest: pytest-benchmark provides rich CI reporting and baseline tracking.
  • Store baselines in CI artifacts and compare on each PR to detect regressions automatically.
  • Budget tightness should match criticality: hot paths need tighter budgets than infrequent operations.
Expected Output
Performance within budget: Xus\nRegression detected when threshold exceeded
Hints

Hint 1: Define a maximum acceptable time (budget). Raise AssertionError if the function exceeds it.

Hint 2: Use timeit.repeat and take the minimum to avoid flaky failures from OS scheduling noise.

© 2026 EngineersOfAI. All rights reserved.