Skip to main content

Profiling Strategy - Measure Before You Optimize

Before you read another word, predict what happens here:

import time

def process_records(records):
results = []
for record in records:
cleaned = record.strip().lower()
if cleaned not in results:
results.append(cleaned)
return results

data = [f" Record_{i % 500} " for i in range(50_000)]

start = time.perf_counter()
process_records(data)
elapsed = time.perf_counter() - start
print(f"Elapsed: {elapsed:.3f}s")

Most engineers look at this and think "string operations are the bottleneck." They might reach for re.sub or try to optimize the strip().lower() chain. They would be wrong.

The real bottleneck is cleaned not in results - a linear scan through a list that grows to 500 elements, executed 50,000 times. That is O(n * m) where m is the number of unique elements. Replacing results with a set drops execution time by over 95%.

The lesson: your intuition about where time is spent is almost always wrong. This module teaches you to stop guessing.

# Fix: use a set for O(1) membership testing
def process_records_fast(records):
seen = set()
results = []
for record in records:
cleaned = record.strip().lower()
if cleaned not in seen:
seen.add(cleaned)
results.append(cleaned)
return results

# ~0.008s vs ~1.2s - a 150x improvement

What You Will Learn

  • Why Amdahl's Law constrains the maximum possible speedup from any optimization
  • The four-phase profiling workflow: measure, identify, optimize, verify
  • How to use timeit correctly for microbenchmarks (and how to use it incorrectly)
  • When optimization is a waste of engineering time
  • How to set and enforce performance budgets in production systems
  • The statistical traps in benchmarking that produce misleading results

Prerequisites

  • Completed the Python Intermediate course (CPython internals, GIL, reference counting)
  • Familiarity with Big-O complexity analysis
  • Basic understanding of how CPython executes bytecode

Part 1 - Amdahl's Law: The Ceiling on Your Optimization

Amdahl's Law is the single most important concept in performance engineering. It tells you the maximum theoretical speedup you can achieve by optimizing a portion of your program:

S = 1 / ((1 - P) + (P / N))

Where:

  • S = overall speedup
  • P = proportion of execution time that can be improved
  • N = speedup factor for the improved portion
def amdahl_speedup(parallel_fraction: float, speedup_factor: float) -> float:
"""Calculate maximum overall speedup via Amdahl's Law."""
return 1 / ((1 - parallel_fraction) + (parallel_fraction / speedup_factor))

# Scenario: 30% of your program is in a hot function.
# You make that function 10x faster.
print(amdahl_speedup(0.30, 10)) # 1.37 - only 37% overall speedup

# Scenario: 90% of your program is in a hot function.
# You make that function 10x faster.
print(amdahl_speedup(0.90, 10)) # 5.26 - significant overall speedup

# Scenario: 90% of your program is in a hot function.
# You make that function infinitely fast.
print(amdahl_speedup(0.90, float('inf'))) # 10.0 - the theoretical ceiling

The implications are brutal:

Hot path %Speedup factorOverall speedup
10%10x1.10x
30%10x1.37x
50%10x1.82x
80%10x4.00x
90%10x5.26x
95%10x6.90x

:::danger Critical Insight If a function consumes 10% of total execution time, making it infinitely fast gives you at most a 1.11x overall speedup. Before you spend three days rewriting it in C, ask: "Is this where the time actually goes?" :::

Corollary: Optimize the Right Thing

Amdahl's Law has a corollary that engineers often miss: the first optimization changes the profile. After you speed up the 90%-hot-path by 10x, it now consumes only ~47% of total execution time. The "other code" that was 10% is now ~53%. Your next optimization target has shifted.

def show_profile_shift():
"""Demonstrate how optimization shifts the profile."""
total_before = 100 # arbitrary units

hot_path = 90 # 90% of total
other = 10 # 10% of total

# After 10x speedup on hot path
hot_path_after = hot_path / 10 # 9 units
total_after = hot_path_after + other # 19 units

print(f"Before: hot_path={hot_path}%, other={other}%")
print(f"After: hot_path={hot_path_after/total_after*100:.1f}%, "
f"other={other/total_after*100:.1f}%")
# Before: hot_path=90%, other=10%
# After: hot_path=47.4%, other=52.6%

show_profile_shift()

This means you must re-profile after every optimization. The profile you measured before the change is now invalid.

Part 2 - The Profiling Workflow

Performance engineering is not ad-hoc. It follows a disciplined, repeatable workflow.

Step 1: Establish a Baseline

You cannot improve what you have not measured. A baseline must be:

  • Reproducible: same machine, same data, same conditions
  • Representative: uses realistic workloads, not toy inputs
  • Recorded: written down with timestamp, hardware specs, and commit hash
import time
import statistics
import platform
import sys

def benchmark_baseline(func, args, n_runs=10):
"""Establish a reproducible performance baseline."""
times = []
for _ in range(n_runs):
start = time.perf_counter()
func(*args)
elapsed = time.perf_counter() - start
times.append(elapsed)

result = {
"function": func.__name__,
"n_runs": n_runs,
"mean_s": statistics.mean(times),
"median_s": statistics.median(times),
"stdev_s": statistics.stdev(times) if len(times) > 1 else 0,
"min_s": min(times),
"max_s": max(times),
"python": sys.version,
"platform": platform.platform(),
}

print(f"Baseline for {func.__name__}:")
print(f" Mean: {result['mean_s']:.6f}s")
print(f" Median: {result['median_s']:.6f}s")
print(f" Stdev: {result['stdev_s']:.6f}s")
print(f" Range: [{result['min_s']:.6f}s, {result['max_s']:.6f}s]")
return result

:::tip Always Use Median, Not Mean A single GC pause or OS scheduler interruption can inflate the mean by 10x. The median is robust to outliers. Report both, but make decisions based on the median. :::

Step 2: Identify Hotspots

Use profiling tools (covered in the next two lessons) to find where time is actually spent. The key question is not "what is slow?" but "what consumes the most cumulative time?"

Step 3: Form a Hypothesis

Before writing any code, articulate why the hotspot is slow. Common hypotheses:

  • "This function is called 10 million times with the same arguments - we can cache results"
  • "This loop does a linear scan through a list - we can use a set or dict"
  • "This allocates a new list on every call - we can reuse a buffer"
  • "This does I/O synchronously - we can batch or parallelize"

Step 4: Apply a Targeted Fix

Change one thing at a time. If you change the algorithm and the data structure simultaneously, you cannot attribute the improvement.

Step 5: Verify Against Baseline

Re-run the exact same benchmark. Compare median times. Calculate the actual speedup factor.

def compare_to_baseline(baseline, new_result):
"""Compare optimized result to baseline."""
speedup = baseline["median_s"] / new_result["median_s"]
print(f"\nSpeedup: {speedup:.2f}x")
print(f" Baseline median: {baseline['median_s']:.6f}s")
print(f" Optimized median: {new_result['median_s']:.6f}s")
if speedup < 1.05:
print(" WARNING: Improvement is within noise margin")
return speedup

Part 3 - Benchmarking with timeit

The timeit module is Python's built-in microbenchmarking tool. It handles the tricky parts: disabling garbage collection during timing, running multiple iterations, and providing statistical summaries.

Basic Usage

import timeit

# Timing a simple expression
t = timeit.timeit('"-".join(str(n) for n in range(100))', number=10_000)
print(f"Total: {t:.4f}s, Per call: {t/10_000*1e6:.2f}us")
# Total: 0.2814s, Per call: 28.14us

# Timing with setup code
t = timeit.timeit(
'sorted(data)',
setup='import random; data = [random.randint(0, 1000) for _ in range(1000)]',
number=1_000,
)
print(f"Per call: {t/1_000*1e6:.2f}us")

The Right Way: timeit.repeat

A single timeit run can be misleading. Use repeat and take the minimum:

times = timeit.repeat(
'sum(range(1000))',
number=10_000,
repeat=5,
)
print(f"Best of 5: {min(times)/10_000*1e6:.2f}us")
# Best of 5: 12.34us

:::note Why Minimum, Not Mean? For microbenchmarks, the minimum time is the most reproducible measurement. Higher times reflect noise (OS scheduling, GC, cache misses from other processes). The minimum represents the "true" cost of the operation when everything goes right. For macro-benchmarks of full systems, use the median instead. :::

Common timeit Mistakes

Mistake 1: Benchmarking code that gets optimized away

# BAD: Python might optimize this
timeit.timeit('1 + 1', number=1_000_000) # Measures nothing useful

# BETTER: Force a side effect
timeit.timeit('x = 1 + 1', number=1_000_000)

Mistake 2: Not accounting for setup cost

# BAD: data creation is included in timing
timeit.timeit('[x**2 for x in range(10000)]', number=100)

# GOOD: separate setup from the operation being measured
timeit.timeit(
'[x**2 for x in data]',
setup='data = list(range(10000))',
number=100,
)

Mistake 3: Too few iterations for fast operations

# BAD: 100 iterations of a nanosecond operation
t = timeit.timeit('len([])', number=100)
print(f"Per call: {t/100*1e9:.1f}ns")
# Heavily influenced by timer overhead

# GOOD: enough iterations to dwarf timer overhead
t = timeit.timeit('len([])', number=10_000_000)
print(f"Per call: {t/10_000_000*1e9:.1f}ns")

Comparing Alternatives Properly

def benchmark_comparison():
"""Compare list comprehension vs map vs generator for squaring."""
setup = 'data = list(range(1000))'
approaches = {
'list_comp': '[x**2 for x in data]',
'map_list': 'list(map(lambda x: x**2, data))',
'for_loop': '''
result = []
for x in data:
result.append(x**2)
''',
}

for name, stmt in approaches.items():
times = timeit.repeat(stmt, setup=setup, number=5_000, repeat=5)
best = min(times) / 5_000 * 1e6
print(f"{name:15s}: {best:8.2f} us")

benchmark_comparison()
# list_comp : 41.23 us
# map_list : 62.87 us
# for_loop : 73.45 us

Part 4 - When NOT to Optimize

This is the most important section in this lesson. Knowing when to optimize is valuable; knowing when not to optimize is priceless.

Knuth's Rule

"Premature optimization is the root of all evil" - Donald Knuth, 1974

The full quote is more nuanced:

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."

The key insight: 97% of optimizations are not worth the added complexity.

The Optimization Decision Framework

Real Examples of When NOT to Optimize

Example 1: Startup code that runs once

# This runs once at import time. Who cares if it takes 5ms vs 0.5ms?
import json
from pathlib import Path

config = json.loads(Path("config.json").read_text())

Example 2: Code dominated by I/O

import time

def fetch_and_process(urls):
"""Network I/O dominates - optimizing processing is pointless."""
results = []
for url in urls:
# This takes ~200ms per request
response = make_http_request(url) # I/O bound

# This takes ~0.1ms - optimizing it saves nothing
processed = parse_response(response)
results.append(processed)
return results

# The right optimization: use asyncio or concurrent.futures, not faster parsing

Example 3: Code that trades readability for marginal speed

# READABLE - keep this
users = [user for user in all_users if user.is_active and user.age >= 18]

# "OPTIMIZED" - 3% faster, impossible to maintain
users = list(filter(lambda u: u.is_active.__and__(u.age.__ge__(18)), all_users))
# Don't do this.

:::danger The Real Cost of Optimization Every optimization adds complexity. Complexity causes bugs. Bugs cause downtime. Downtime costs far more than the CPU cycles you saved. Optimize only when measurement proves you must. :::

Part 5 - Performance Budgets

A performance budget is a quantitative threshold that your system must meet. It transforms vague goals ("make it faster") into testable requirements.

Defining Performance Budgets

from dataclasses import dataclass
from enum import Enum

class BudgetSeverity(Enum):
WARNING = "warning" # Log and alert
ERROR = "error" # Fail the test / deploy gate
CRITICAL = "critical" # Page on-call

@dataclass
class PerformanceBudget:
name: str
p50_ms: float # Median response time
p99_ms: float # 99th percentile
max_memory_mb: float # Peak memory usage
severity: BudgetSeverity

def check(self, actual_p50: float, actual_p99: float, actual_memory: float):
violations = []
if actual_p50 > self.p50_ms:
violations.append(
f"p50 {actual_p50:.1f}ms > budget {self.p50_ms:.1f}ms"
)
if actual_p99 > self.p99_ms:
violations.append(
f"p99 {actual_p99:.1f}ms > budget {self.p99_ms:.1f}ms"
)
if actual_memory > self.max_memory_mb:
violations.append(
f"memory {actual_memory:.1f}MB > budget {self.max_memory_mb:.1f}MB"
)
return violations

# Define budgets for your API endpoints
api_budgets = {
"/api/courses": PerformanceBudget(
name="List Courses",
p50_ms=50,
p99_ms=200,
max_memory_mb=100,
severity=BudgetSeverity.ERROR,
),
"/api/certificates/generate": PerformanceBudget(
name="Generate Certificate",
p50_ms=500,
p99_ms=2000,
max_memory_mb=256,
severity=BudgetSeverity.WARNING,
),
}

Enforcing Budgets in CI

import json
import sys

def check_budgets(benchmark_results_file: str, budgets: dict) -> bool:
"""CI gate: fail the build if any ERROR-level budget is violated."""
with open(benchmark_results_file) as f:
results = json.load(f)

any_failures = False
for endpoint, budget in budgets.items():
actual = results.get(endpoint, {})
violations = budget.check(
actual.get("p50_ms", 0),
actual.get("p99_ms", 0),
actual.get("memory_mb", 0),
)
if violations:
severity = budget.severity.value.upper()
print(f"[{severity}] {budget.name}:")
for v in violations:
print(f" - {v}")
if budget.severity == BudgetSeverity.ERROR:
any_failures = True

if any_failures:
print("\nBuild FAILED: performance budget violations detected.")
sys.exit(1)
else:
print("\nAll performance budgets met.")

# In CI: python check_perf.py benchmark_results.json

Tracking Budgets Over Time

The most valuable performance data is trend data. A single benchmark tells you nothing; a graph of benchmarks across 100 commits tells you exactly when performance regressed and which commit caused it.

import sqlite3
from datetime import datetime

def record_benchmark(db_path: str, commit: str, endpoint: str,
p50: float, p99: float, memory: float):
"""Record benchmark results for trend analysis."""
conn = sqlite3.connect(db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS benchmarks (
timestamp TEXT, commit_hash TEXT, endpoint TEXT,
p50_ms REAL, p99_ms REAL, memory_mb REAL
)
""")
conn.execute(
"INSERT INTO benchmarks VALUES (?, ?, ?, ?, ?, ?)",
(datetime.utcnow().isoformat(), commit, endpoint, p50, p99, memory),
)
conn.commit()
conn.close()

def detect_regression(db_path: str, endpoint: str,
current_p50: float, threshold: float = 0.10):
"""Alert if p50 regressed more than threshold vs. last 10 runs."""
conn = sqlite3.connect(db_path)
rows = conn.execute(
"SELECT p50_ms FROM benchmarks WHERE endpoint = ? "
"ORDER BY timestamp DESC LIMIT 10",
(endpoint,),
).fetchall()
conn.close()

if not rows:
return False

historical_median = sorted(r[0] for r in rows)[len(rows) // 2]
regression = (current_p50 - historical_median) / historical_median

if regression > threshold:
print(f"REGRESSION: {endpoint} p50 increased by {regression:.1%}")
print(f" Historical median: {historical_median:.1f}ms")
print(f" Current: {current_p50:.1f}ms")
return True
return False

Part 6 - Statistical Pitfalls in Benchmarking

Benchmarking is experimental science. If you do not control for variables, your results are noise.

Pitfall 1: JIT Warmup and Caching

import timeit

def demonstrate_warmup():
"""First calls are slower due to cache misses and lazy initialization."""
setup = '''
import json
data = '{"users": [' + ','.join(f'{{"id": {i}}}' for i in range(1000)) + ']}'
'''
# Cold run
cold = timeit.timeit('json.loads(data)', setup=setup, number=1)

# Warm run (caches are hot, branches predicted)
warm = timeit.timeit('json.loads(data)', setup=setup, number=1000) / 1000

print(f"Cold: {cold*1e6:.1f}us")
print(f"Warm: {warm*1e6:.1f}us")
# Cold is typically 2-5x slower than warm

demonstrate_warmup()

Pitfall 2: Garbage Collection Pauses

import gc
import timeit

# timeit disables GC by default - this is intentional for microbenchmarks
# but deceptive for real-world performance expectations

# For realistic benchmarks, re-enable GC:
t_no_gc = timeit.timeit(
'[list(range(100)) for _ in range(1000)]',
number=100,
)

t_with_gc = timeit.timeit(
'[list(range(100)) for _ in range(1000)]',
number=100,
timer=time.perf_counter, # default timer
)
# Note: timeit automatically disables GC. For realistic measurement:
gc.enable()
# ... run your benchmark manually

Pitfall 3: System Noise

def robust_benchmark(func, args=(), n_iterations=1000, n_trials=10):
"""
Robust benchmarking that accounts for system noise.

Returns the minimum of n_trials runs, each consisting of n_iterations.
Also reports coefficient of variation to flag noisy measurements.
"""
import time
import statistics

trial_times = []
for _ in range(n_trials):
start = time.perf_counter()
for _ in range(n_iterations):
func(*args)
elapsed = time.perf_counter() - start
trial_times.append(elapsed / n_iterations)

mean = statistics.mean(trial_times)
stdev = statistics.stdev(trial_times) if len(trial_times) > 1 else 0
cv = stdev / mean if mean > 0 else 0

result = {
"min": min(trial_times),
"median": statistics.median(trial_times),
"mean": mean,
"stdev": stdev,
"cv": cv, # coefficient of variation
}

if cv > 0.05:
print(f"WARNING: CV={cv:.2%} - results may be unreliable. "
"Close other applications and retry.")

return result

:::tip Reducing System Noise

  1. Close all unnecessary applications
  2. Disable Turbo Boost (frequency scaling adds variance)
  3. Pin the process to a single CPU core: taskset -c 0 python bench.py
  4. Run benchmarks on dedicated hardware, not shared CI runners
  5. Take enough samples - at least 10 trials, preferably 30+ :::

Key Takeaways

  • Amdahl's Law is the ceiling: if a function is 10% of execution time, making it infinitely fast yields at most 1.11x overall speedup. Always profile to find the true hotspot.
  • The profiling workflow is non-negotiable: measure, identify, hypothesize, optimize, verify. Skipping any step leads to wasted effort.
  • Use timeit.repeat correctly: take the minimum for microbenchmarks, use median for macro-benchmarks, and always separate setup from the timed code.
  • 97% of optimizations are not worth doing: if there is no measured performance problem, there is no optimization to make.
  • Performance budgets transform goals into tests: define p50, p99, and memory limits per endpoint and enforce them in CI.
  • Re-profile after every change: the profile shifts after each optimization. Yesterday's hotspot may be today's non-issue.
  • Benchmarking is experimental science: control for GC pauses, warmup effects, and system noise, or your results mean nothing.

Graded Practice Challenges

Level 1 - Predict the Output

Question 1: According to Amdahl's Law, a function consumes 60% of your program's execution time. You rewrite it and achieve a 5x speedup on that function. What is the overall speedup?

Answer

S = 1 / ((1 - 0.60) + (0.60 / 5)) = 1 / (0.40 + 0.12) = 1 / 0.52 ≈ 1.92x

The overall speedup is approximately 1.92x - not the 5x you might hope for. 40% of the program is unaffected.

Question 2: What does this timeit call measure?

t = timeit.timeit(
'sorted(data)',
setup='import random; data = random.sample(range(10000), 10000)',
number=5000,
)
Answer

It measures sorting an already-sorted list 4,999 times and sorting a random list once. The setup runs once before all iterations. After the first sorted(data) call, data itself is unchanged (since sorted returns a new list), so this is fine - it sorts the same random list 5,000 times. However, if you used data.sort() instead, the list would be sorted in-place after the first call, and you would be benchmarking sorting an already-sorted list 4,999 times.

Question 3: You have two implementations. Benchmark A reports min=1.2ms, mean=1.8ms, median=1.3ms. Benchmark B reports min=1.4ms, mean=1.5ms, median=1.45ms. Which is faster for a microbenchmark comparison?

Answer

For microbenchmarks, we compare the minimum: A is faster (1.2ms vs 1.4ms). The minimum represents the best-case execution without OS noise.

However, note that A has high variance (mean is 50% above min, indicating noisy measurement). If this were a macro-benchmark of a full system, you would compare medians: A (1.3ms) vs B (1.45ms) - A still wins but by a smaller margin. You should also investigate why A has such high variance.

Level 2 - Debug Challenge

The following benchmark code produces misleading results. Identify and fix all the problems:

import timeit
import time

def benchmark_serializers():
results = {}

# Benchmark JSON
json_time = timeit.timeit(
'''
import json
data = {"users": [{"id": i, "name": f"user_{i}"} for i in range(100)]}
json.dumps(data)
''',
number=100,
)
results["json"] = json_time

# Benchmark pickle
pickle_time = timeit.timeit(
'''
import pickle
data = {"users": [{"id": i, "name": f"user_{i}"} for i in range(100)]}
pickle.dumps(data)
''',
number=100,
)
results["pickle"] = pickle_time

for name, t in results.items():
print(f"{name}: {t:.4f}s")

benchmark_serializers()
Answer

There are four problems:

  1. Import and data creation are inside the timed code. The benchmark measures import time and data construction, not serialization.
  2. Only one trial. Using timeit.timeit instead of timeit.repeat gives a single noisy sample.
  3. Too few iterations. 100 iterations for a microsecond operation means timer overhead dominates.
  4. No per-call normalization. The reported time is total, not per-call.

Fixed version:

import timeit

def benchmark_serializers():
setup = '''
import json
import pickle
data = {"users": [{"id": i, "name": f"user_{i}"} for i in range(100)]}
'''
n = 10_000
repeats = 5

json_times = timeit.repeat('json.dumps(data)', setup=setup,
number=n, repeat=repeats)
pickle_times = timeit.repeat('pickle.dumps(data)', setup=setup,
number=n, repeat=repeats)

print(f"json: {min(json_times)/n*1e6:.2f} us/call (best of {repeats})")
print(f"pickle: {min(pickle_times)/n*1e6:.2f} us/call (best of {repeats})")

benchmark_serializers()

Level 3 - Design Challenge

Design a performance regression detection system for a FastAPI application with 12 endpoints. Requirements:

  1. Benchmarks must run in CI on every PR
  2. Each endpoint has a p50 and p99 latency budget
  3. Regressions larger than 10% from the rolling median of the last 20 runs should block the PR
  4. The system must handle noisy CI runners (shared hardware)

Describe your architecture, data storage, statistical approach, and how you handle flaky detections. Write the core regression detection function.

Solution Sketch

Architecture:

  • Store benchmark results in a SQLite database committed to the repo (or a shared metrics service)
  • Run benchmarks via pytest-benchmark with --benchmark-json=results.json
  • A CI script parses results, compares to historical data, and posts a PR comment

Handling noise on shared CI:

  • Run each benchmark with rounds=50, warmup_rounds=5
  • Use the trimmed mean (discard top and bottom 10% of samples) instead of raw median
  • Require regression to appear in 2 consecutive runs before blocking (flake filter)

Core detection function:

import statistics

def detect_regression(
historical: list[float], # last 20 trimmed means
current: float, # current trimmed mean
threshold: float = 0.10, # 10% regression threshold
min_samples: int = 5, # minimum historical samples needed
) -> dict:
if len(historical) < min_samples:
return {"status": "insufficient_data", "blocked": False}

baseline = statistics.median(historical)
change = (current - baseline) / baseline

# Use IQR to determine if current is an outlier
q1 = statistics.quantiles(historical, n=4)[0]
q3 = statistics.quantiles(historical, n=4)[2]
iqr = q3 - q1
upper_fence = q3 + 1.5 * iqr

result = {
"baseline_median": baseline,
"current": current,
"change_pct": change * 100,
"is_outlier": current > upper_fence,
"blocked": change > threshold and current > upper_fence,
"status": "regression" if change > threshold else "ok",
}
return result

The key insight is combining percentage-based thresholds with statistical outlier detection (IQR fence). This prevents false positives when all historical values happen to be unusually fast.

What's Next

Now that you understand the discipline of profiling, the next lesson dives into the tools. In cProfile and pstats, you will learn to profile real Python programs at the function level - measuring exactly how many times each function is called and how much cumulative time it consumes.

© 2026 EngineersOfAI. All rights reserved.