Python Profiling Strategy Practice Problems & Exercises
Practice: Profiling Strategy
← Back to lessonEasy
Use timeit to compare the performance of a list comprehension vs map() for squaring numbers 0–999. Print which is faster.
import timeit
def bench_list_comp():
return [x * x for x in range(1000)]
def bench_map():
return list(map(lambda x: x * x, range(1000)))
n = 10000
lc_time = timeit.timeit(bench_list_comp, number=n)
map_time = timeit.timeit(bench_map, number=n)
# Print per-call times and which is faster
# Expected final line: "list comprehension is faster than map+list"
Solution
import timeit
def bench_list_comp():
return [x * x for x in range(1000)]
def bench_map():
return list(map(lambda x: x * x, range(1000)))
n = 10000
lc_time = timeit.timeit(bench_list_comp, number=n)
map_time = timeit.timeit(bench_map, number=n)
print(f"list comp: {lc_time/n*1e6:.2f} us/call")
print(f"map+list: {map_time/n*1e6:.2f} us/call")
winner = "list comprehension" if lc_time < map_time else "map+list"
other = "map+list" if winner == "list comprehension" else "list comprehension"
print(f"{winner} is faster than {other}")
Key concepts:
number=10000runs the function 10,000 times to average out OS scheduling noise.- Divide total time by
numberto get per-call time. - List comprehensions are typically faster than
map+lambdabecause they use the specializedLIST_APPENDbytecode — no Python-level lambda dispatch. - Always run benchmarks multiple times; the first run may include interpreter warmup effects.
Expected Output
list comprehension is faster than map+listHints
Hint 1: Use timeit.timeit(fn, number=N) to run fn N times and return total elapsed seconds.
Hint 2: Divide by number to get per-call time, multiply by 1e6 to convert to microseconds.
Use time.perf_counter() to manually time how long it takes to build a dictionary of 100,000 squares.
import time
def build_squares(n):
return {i: i * i for i in range(n)}
# Time build_squares(100_000) using perf_counter
# Print: "elapsed: X ms" (round to 2 decimal places)
Solution
import time
def build_squares(n):
return {i: i * i for i in range(n)}
start = time.perf_counter()
build_squares(100_000)
end = time.perf_counter()
elapsed_ms = (end - start) * 1000
print(f"elapsed: {elapsed_ms:.2f} ms")
Why perf_counter:
time.perf_counter()is the highest-resolution clock available on the platform.- It measures wall-clock time — includes I/O waits and OS scheduling.
- For pure computation, wall time ≈ CPU time. For I/O-heavy code, use
time.process_time()for CPU-only measurement. perf_counteris monotonic — it never goes backwards, making it safe for elapsed-time calculations.
Expected Output
elapsed: X msHints
Hint 1: time.perf_counter() returns a float of seconds with the highest available resolution.
Hint 2: Subtract the start value from the end value to get elapsed time.
Given a main_work() function that calls both a fast and a slow sub-function, use cProfile to identify the hotspot.
import cProfile
import io
import pstats
def fast_work():
return sum(range(100))
def slow_work():
total = 0
for i in range(50000):
total += i * i
return total
def main_work():
results = [fast_work() for _ in range(10)]
results.append(slow_work())
return results
# Profile main_work, capture output, check if slow_work appears as hotspot
# Print: "Hotspot identified: slow_work accounts for most of the time"
Solution
import cProfile
import io
import pstats
def fast_work():
return sum(range(100))
def slow_work():
total = 0
for i in range(50000):
total += i * i
return total
def main_work():
results = [fast_work() for _ in range(10)]
results.append(slow_work())
return results
pr = cProfile.Profile()
pr.enable()
main_work()
pr.disable()
buf = io.StringIO()
ps = pstats.Stats(pr, stream=buf).sort_stats("tottime")
ps.print_stats(5)
output = buf.getvalue()
if "slow_work" in output:
print("Hotspot identified: slow_work accounts for most of the time")
print(output)
Hotspot identification:
tottime= time spent exclusively in the function body (not counting callees).cumtime= total time including all nested calls — useful for finding the "root" of slowness.fast_workruns 10× but each call is trivial.slow_workruns once and dominatestottime.- Rule: optimize the function with the highest
tottimefirst — not the most frequently called one.
Expected Output
Hotspot identified: slow_work accounts for most of the timeHints
Hint 1: Use cProfile.Profile() as a context manager or call pr.enable()/pr.disable() around the target code.
Hint 2: Sort by tottime to see which function body (excluding callees) consumes the most CPU.
Empirically confirm the complexity of a linear vs quadratic function by timing them at increasing input sizes.
import timeit
def linear_sum(n):
return sum(range(n))
def quadratic_pairs(n):
count = 0
for i in range(n):
for j in range(n):
count += 1
return count
# Time both at n=100, 500, 2500 and print scaling ratios
# Print two lines:
# "O(n) scales linearly"
# "O(n^2) scales quadratically"
Solution
import timeit
def linear_sum(n):
return sum(range(n))
def quadratic_pairs(n):
count = 0
for i in range(n):
for j in range(n):
count += 1
return count
sizes = [100, 500, 2500]
linear_times = [
timeit.timeit(lambda n=n: linear_sum(n), number=200) / 200 * 1e6
for n in sizes
]
quad_times = [
timeit.timeit(lambda n=n: quadratic_pairs(n), number=10) / 10 * 1e6
for n in sizes
]
linear_ratio = linear_times[-1] / linear_times[0]
quad_ratio = quad_times[-1] / quad_times[0]
expected_linear = sizes[-1] / sizes[0] # ~25x
expected_quad = (sizes[-1] / sizes[0]) ** 2 # ~625x
print(f"linear: {linear_times[0]:.1f}us -> {linear_times[-1]:.1f}us ratio={linear_ratio:.1f}x (expected ~{expected_linear}x)")
print(f"quad: {quad_times[0]:.1f}us -> {quad_times[-1]:.1f}us ratio={quad_ratio:.0f}x (expected ~{expected_quad:.0f}x)")
print("O(n) scales linearly")
print("O(n^2) scales quadratically")
Empirical complexity measurement:
- Increase input size by 5× (100 → 500 → 2500). O(n) time grows ~5×; O(n^2) grows ~25×.
- Real measurements are noisy — use
timeit.repeatand take the minimum for cleaner ratios. - Always measure empirically even when you think you know the complexity — hidden constant factors and cache effects can surprise you.
Expected Output
O(n) scales linearly\nO(n^2) scales quadraticallyHints
Hint 1: Time both functions at multiple input sizes: n=100, 1000, 10000.
Hint 2: If doubling n doubles time, it is O(n). If doubling n quadruples time, it is O(n^2).
Medium
Apply the profiling methodology — measure, hypothesize, optimize, verify — on a string concatenation function.
import timeit
def original(words):
result = ""
for word in words:
result += word + " "
return result.strip()
def optimized(words):
# Implement a faster version using str.join
pass
words = ["python"] * 1000
# Verify both produce the same output, then time both
# Print: "Original: Xus Optimized: Yus Speedup: Zx"
Solution
import timeit
def original(words):
result = ""
for word in words:
result += word + " "
return result.strip()
def optimized(words):
return " ".join(words)
words = ["python"] * 1000
assert original(words) == optimized(words), "correctness check failed"
n = 5000
orig_us = timeit.timeit(lambda: original(words), number=n) / n * 1e6
opt_us = timeit.timeit(lambda: optimized(words), number=n) / n * 1e6
speedup = orig_us / opt_us
print(f"Original: {orig_us:.1f}us")
print(f"Optimized: {opt_us:.1f}us")
print(f"Speedup: {speedup:.1f}x")
Hypothesis-driven optimization:
- String
+=in a loop is O(n^2) — each concatenation allocates a brand-new string object. str.join()is O(n) — it measures total length once, allocates one buffer, then fills it.- Typical speedup for 1000-word list: 10–50× depending on string length.
- Workflow: measure → identify hotspot → form specific hypothesis → implement → measure again → confirm.
Expected Output
Original: Xus Optimized: Yus Speedup: ZxHints
Hint 1: Profile first — confirm where time is spent. Only then form a hypothesis and optimize.
Hint 2: Verify correctness before measuring speedup. An incorrect optimization is not an optimization.
Build a benchmark that separates warmup runs from measurement runs and quantifies the difference.
import timeit
import time
def compute(n):
"""Some representative work."""
result = 0
for i in range(n):
result += i % 7
return result
def benchmark_cold(fn, n, number=1000):
"""No warmup — first runs included."""
times = timeit.repeat(lambda: fn(n), repeat=5, number=number)
return min(t / number * 1e6 for t in times)
def benchmark_warm(fn, n, warmup=5, number=1000):
"""Warmup runs discarded before measurement."""
for _ in range(warmup):
fn(n)
times = timeit.repeat(lambda: fn(n), repeat=5, number=number)
return min(t / number * 1e6 for t in times)
# Compare cold vs warm and print results
# Final line format: "warmup matters: True/False"
Solution
import timeit
def compute(n):
result = 0
for i in range(n):
result += i % 7
return result
def benchmark_cold(fn, n, number=1000):
times = timeit.repeat(lambda: fn(n), repeat=5, number=number)
return min(t / number * 1e6 for t in times)
def benchmark_warm(fn, n, warmup=5, number=1000):
for _ in range(warmup):
fn(n)
times = timeit.repeat(lambda: fn(n), repeat=5, number=number)
return min(t / number * 1e6 for t in times)
cold = benchmark_cold(compute, 10000)
warm = benchmark_warm(compute, 10000)
diff_pct = abs(cold - warm) / warm * 100
matters = diff_pct > 5.0
print(f"cold run: {cold:.2f}us warm run: {warm:.2f}us diff: {diff_pct:.1f}%")
print(f"warmup matters: {matters}")
Warmup effects in Python:
- CPython caches
.pycbytecode on disk — second import is faster. - OS page cache: first access brings pages into RAM; subsequent accesses hit cache.
- On PyPy/Graal, JIT compilation happens during warmup — cold runs are 10–100× slower.
- Always warm up before measuring latency-sensitive code. Discard the first 3–5 runs.
- Use
timeit.repeat(repeat=5)and takemin()— notmean()— as your performance target.
Expected Output
cold run: Xus warm run: Yus warmup matters: True/FalseHints
Hint 1: Run warmup iterations before your timed measurements and discard them.
Hint 2: First-call overhead includes bytecode compilation cache misses and interpreter setup.
Use tracemalloc to compare the peak memory allocated by a list of 10,000 integers vs a dict mapping each integer to its square.
import tracemalloc
def build_list(n):
return [i for i in range(n)]
def build_dict(n):
return {i: i * i for i in range(n)}
N = 10000
# Measure peak allocation for build_list(N) and build_dict(N)
# Print:
# "list allocated: X KB"
# "dict allocated: Y KB"
# "larger structure: list" or "larger structure: dict"
Solution
import tracemalloc
def build_list(n):
return [i for i in range(n)]
def build_dict(n):
return {i: i * i for i in range(n)}
N = 10000
tracemalloc.start()
build_list(N)
_, list_peak = tracemalloc.get_traced_memory()
tracemalloc.clear_traces()
build_dict(N)
_, dict_peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
list_kb = list_peak / 1024
dict_kb = dict_peak / 1024
larger = "dict" if dict_peak > list_peak else "list"
print(f"list allocated: {list_kb:.1f} KB")
print(f"dict allocated: {dict_kb:.1f} KB")
print(f"larger structure: {larger}")
tracemalloc usage:
tracemalloc.get_traced_memory()returns(current, peak)in bytes sincetracemalloc.start().tracemalloc.clear_traces()resets counters — use between measurements.- A
listof N ints uses ~8 bytes per pointer + object overhead; adictuses ~50–70 bytes per entry (hash table overhead). - Dicts are typically 3–5× larger than lists for the same number of integers.
Expected Output
list allocated: X KB\ndict allocated: Y KB\nlarger structure: list/dictHints
Hint 1: Use tracemalloc.start() / tracemalloc.stop() and tracemalloc.get_traced_memory() to measure peak allocation.
Hint 2: tracemalloc.clear_traces() between measurements resets the counters.
Implement amdahls_speedup(serial_fraction, num_processors) and use it to show why small serial fractions are critical.
def amdahls_speedup(serial_fraction, num_processors):
"""Return theoretical speedup per Amdahl's law."""
pass
# Test with serial_fraction=0.2 at N=1,2,4,8,16,32,infinity
# Print: "Serial fraction 0.2: max speedup = 5.0x"
# Test with serial_fraction=0.05
# Print: "Serial fraction 0.05: max speedup = 20.0x"
Solution
def amdahls_speedup(serial_fraction, num_processors):
if num_processors == float("inf"):
return 1.0 / serial_fraction
parallel_fraction = 1.0 - serial_fraction
return 1.0 / (serial_fraction + parallel_fraction / num_processors)
for serial in [0.2, 0.05]:
processors = [1, 2, 4, 8, 16, 32]
speedups = [amdahls_speedup(serial, n) for n in processors]
max_speedup = amdahls_speedup(serial, float("inf"))
print(f"\nSerial fraction {serial}:")
for n, s in zip(processors, speedups):
print(f" N={n:2d}: {s:.2f}x")
print(f"Serial fraction {serial}: max speedup = {max_speedup:.1f}x")
Amdahl's Law implications:
- If 20% of a program is serial (cannot be parallelized), the maximum possible speedup is 5× — even with infinite cores.
- Reducing the serial fraction from 20% to 5% raises the ceiling from 5× to 20×.
- Profiling's job is to find and reduce the serial bottleneck, not just add more cores.
- Corollary: adding hardware is often cheaper than hiring engineers — but Amdahl sets the hard ceiling.
Expected Output
Serial fraction 0.2: max speedup = 5.0x\nSerial fraction 0.05: max speedup = 20.0xHints
Hint 1: Amdahl's law: S = 1 / (F_serial + (1 - F_serial) / N) where F_serial is the fraction that cannot be parallelized and N is the number of processors.
Hint 2: As N approaches infinity, max speedup = 1 / F_serial.
Hard
Build a production-quality benchmark harness: warmup, GC disabled during measurement, statistical reporting.
import timeit
import gc
import statistics
def benchmark(fn, warmup=3, repeat=7, number=1000):
"""
Production benchmark:
1. Warmup (not measured)
2. GC disabled during timed section
3. Return dict with min_us, mean_us, stdev_us, samples
"""
pass
def target():
return sorted([3, 1, 4, 1, 5, 9, 2, 6] * 100)
result = benchmark(target)
print(f"min={result['min_us']:.2f}us mean={result['mean_us']:.2f}us stdev={result['stdev_us']:.3f}us n={result['samples']}")
Solution
import timeit
import gc
import statistics
def benchmark(fn, warmup=3, repeat=7, number=1000):
for _ in range(warmup):
fn()
gc.disable()
try:
times = timeit.repeat(fn, repeat=repeat, number=number)
finally:
gc.enable()
per_call_us = [t / number * 1e6 for t in times]
return {
"min_us": min(per_call_us),
"mean_us": statistics.mean(per_call_us),
"stdev_us": statistics.stdev(per_call_us),
"samples": len(per_call_us),
}
def target():
return sorted([3, 1, 4, 1, 5, 9, 2, 6] * 100)
result = benchmark(target)
print(f"min={result['min_us']:.2f}us mean={result['mean_us']:.2f}us stdev={result['stdev_us']:.3f}us n={result['samples']}")
Production benchmark best practices:
- GC disabled during measurement prevents unpredictable GC pauses from inflating individual samples. Always re-enable in
finally. - Warmup: populates OS page cache, Python bytecode cache, and JIT (on PyPy/GraalPy). Warmup samples are discarded.
- Report
minas your performance target — it represents the cleanest execution with least interference. stdevreveals consistency: high stdev means results are noisy (OS scheduling, thermal throttling).- Additional best practices for critical benchmarks: pin the process to a CPU core (
taskset -c 0on Linux), disable CPU frequency scaling.
Expected Output
min=Xus mean=Yus stdev=Zus n=7Hints
Hint 1: Disable GC during measurement with gc.disable() and always re-enable in a finally block.
Hint 2: Use timeit.repeat(repeat=7) for 7 independent samples. Report min (best case) and stdev (noise level).
Profile find_duplicates_naive (O(n^2) list search) vs find_duplicates_fast (O(n) set lookup). Quantify the speedup and explain the profiling evidence.
import timeit
import cProfile
import io
import pstats
def find_duplicates_naive(items):
seen = []
duplicates = []
for item in items:
if item in seen:
duplicates.append(item)
else:
seen.append(item)
return duplicates
def find_duplicates_fast(items):
seen = set()
duplicates = []
for item in items:
if item in seen:
duplicates.append(item)
else:
seen.add(item)
return duplicates
data = list(range(2000)) + list(range(1000))
# 1. Assert both produce the same result
# 2. Time both at number=200, report ms per call
# 3. Print speedup
# Output format:
# "O(n^2): Xms"
# "O(n): Yms"
# "Speedup: Zx"
Solution
import timeit
def find_duplicates_naive(items):
seen = []
duplicates = []
for item in items:
if item in seen:
duplicates.append(item)
else:
seen.append(item)
return duplicates
def find_duplicates_fast(items):
seen = set()
duplicates = []
for item in items:
if item in seen:
duplicates.append(item)
else:
seen.add(item)
return duplicates
data = list(range(2000)) + list(range(1000))
assert sorted(find_duplicates_naive(data)) == sorted(find_duplicates_fast(data))
n = 200
naive_ms = timeit.timeit(lambda: find_duplicates_naive(data), number=n) / n * 1000
fast_ms = timeit.timeit(lambda: find_duplicates_fast(data), number=n) / n * 1000
print(f"O(n^2): {naive_ms:.2f}ms")
print(f"O(n): {fast_ms:.3f}ms")
print(f"Speedup: {naive_ms/fast_ms:.0f}x")
Data structure impact on algorithmic complexity:
list.__contains__is O(n): scans from the start on every check.set.__contains__is O(1) average: hash-based lookup.- For 3000 items, naive does ~4.5M comparisons; fast does ~3000 hash lookups.
- Profiling
find_duplicates_naivewith cProfile showslist.__contains__dominatingtottime. - Key lesson: data structure choice (list vs set) has more impact than any micro-optimization.
Expected Output
O(n^2): Xms\nO(n): Yms\nSpeedup: ZxHints
Hint 1: Profile both versions with cProfile to see which internal operation dominates tottime.
Hint 2: list.__contains__ is O(n); set.__contains__ is O(1). For 3000 items, the difference is 3000 comparisons vs 1 hash lookup.
Write a PerformanceBudget class that asserts a function finishes within a time budget and raises AssertionError with a clear message when it does not.
import timeit
class PerformanceBudget:
def __init__(self, max_us, name=""):
self.max_us = max_us
self.name = name
def assert_within_budget(self, fn, warmup=2, repeat=5, number=500):
"""
Warm up fn, then measure best time.
Raise AssertionError if best_us > self.max_us.
Return best_us if within budget.
"""
pass
def efficient_sum(n):
return n * (n - 1) // 2
budget = PerformanceBudget(max_us=10.0, name="efficient_sum")
result = budget.assert_within_budget(lambda: efficient_sum(10000))
print(f"Performance within budget: {result:.2f}us")
tight = PerformanceBudget(max_us=0.001, name="artificially_tight")
try:
tight.assert_within_budget(lambda: efficient_sum(10000))
except AssertionError:
print("Regression detected when threshold exceeded")
Solution
import timeit
class PerformanceBudget:
def __init__(self, max_us, name=""):
self.max_us = max_us
self.name = name
def assert_within_budget(self, fn, warmup=2, repeat=5, number=500):
for _ in range(warmup):
fn()
times = timeit.repeat(fn, repeat=repeat, number=number)
best_us = min(t / number * 1e6 for t in times)
if best_us > self.max_us:
raise AssertionError(
f"PERFORMANCE REGRESSION: {self.name} took {best_us:.2f}us "
f"(budget: {self.max_us}us)"
)
return best_us
def efficient_sum(n):
return n * (n - 1) // 2
budget = PerformanceBudget(max_us=10.0, name="efficient_sum")
result = budget.assert_within_budget(lambda: efficient_sum(10000))
print(f"Performance within budget: {result:.2f}us")
tight = PerformanceBudget(max_us=0.001, name="artificially_tight")
try:
tight.assert_within_budget(lambda: efficient_sum(10000))
except AssertionError as e:
print("Regression detected when threshold exceeded")
Performance regression testing in CI:
- Use
min()over repeated samples to get the best-achievable time — minimizes false positives from CI load. - Set budget conservatively: if the function typically takes 2us, set the budget to 10us for headroom.
- Integrate with pytest:
pytest-benchmarkprovides rich CI reporting and baseline tracking. - Store baselines in CI artifacts and compare on each PR to detect regressions automatically.
- Budget tightness should match criticality: hot paths need tighter budgets than infrequent operations.
Expected Output
Performance within budget: Xus\nRegression detected when threshold exceededHints
Hint 1: Define a maximum acceptable time (budget). Raise AssertionError if the function exceeds it.
Hint 2: Use timeit.repeat and take the minimum to avoid flaky failures from OS scheduling noise.
