Skip to main content

Time Complexity of Python Data Structures - Big-O in Practice

Reading time: ~20 minutes | Level: Foundation → Engineering

Here is a question that catches most Python developers off guard.

import time

n = 100_000

# Method A: insert at front of list
data_a = []
start = time.perf_counter()
for i in range(n):
data_a.insert(0, i)
elapsed_a = time.perf_counter() - start

# Method B: appendleft on deque
from collections import deque
data_b = deque()
start = time.perf_counter()
for i in range(n):
data_b.appendleft(i)
elapsed_b = time.perf_counter() - start

print(f"list.insert(0): {elapsed_a:.3f}s")
print(f"deque.appendleft: {elapsed_b:.4f}s")
print(f"Ratio: {elapsed_a / elapsed_b:.0f}x slower")

Actual output on a modern machine:

list.insert(0): 1.847s
deque.appendleft: 0.0071s
Ratio: 260x slower

Both produce the same result. The performance difference is 260x. The cause is purely data structure choice - and it comes down to Big-O.

What You Will Learn

  • The Big-O complexity classes that matter for engineering: O(1), O(log n), O(n), O(n log n), O(n²)
  • Exact time complexity for every operation on list, dict, set, tuple, string, and deque
  • Why list.insert(0, x) is O(n) - the memory shift required in C
  • Why deque.appendleft() is O(1) - the doubly-linked structure
  • Amortized analysis: why list.append() is O(1) despite occasional O(n) resizes
  • How hash collisions affect dict and set worst-case behavior
  • How to profile code with timeit to empirically verify complexity
  • A decision framework for choosing the right data structure based on workload

Prerequisites

  • Python lists, dicts, sets, tuples (Module 4, Lessons 1–8)
  • Basic familiarity with Python functions and loops
  • Python Lists Internals (Lesson 1 of this module) for memory model background

Part 1 - Big-O Notation: The Engineer's Vocabulary

Big-O notation describes how runtime grows as input size grows. It is not about seconds - it is about scaling behavior.

Complexityn=10n=1,000n=1,000,000n=1,000,000,000
O(1)1111
O(log n)3102030
O(n)101,0001,000,0001,000,000,000
O(n log n)3310,00020,000,00030,000,000,000
O(n²)1001,000,00010¹²10¹⁸

The jump from O(n) to O(n²) at n=1,000,000 is the difference between completing in 1 second and taking 11 days.

:::note What Big-O Ignores Big-O drops constants and lower-order terms. O(100n) and O(n) are both written as O(n). This is intentional - constants depend on hardware, and the growth curve dominates at scale. When constants matter (e.g., n is always small), profiling beats Big-O analysis. :::

The Complexity Classes You Will Encounter Daily

O(1) - Constant: Runtime does not change regardless of input size.

# All O(1)
d = {"key": "value"}
x = d["key"] # Hash lookup - O(1) average
lst = [1, 2, 3, 4, 5]
y = lst[2] # Pointer arithmetic - always O(1)
n = len(lst) # Stored in struct - O(1)

O(log n) - Logarithmic: Runtime grows by 1 for every doubling of input.

import bisect

# Binary search on sorted list - O(log n)
data = list(range(1_000_000))
idx = bisect.bisect_left(data, 500_000) # ~20 comparisons for 1M elements

O(n) - Linear: Runtime grows proportionally with input size.

# All O(n) - must touch every element
total = sum([1, 2, 3, ..., n]) # O(n)
found = 42 in my_list # O(n) - scan until found or end

O(n log n) - Linearithmic: The best achievable for comparison-based sorting.

data = [5, 3, 8, 1, 9, 2]
data.sort() # O(n log n) - Timsort

O(n²) - Quadratic: Every element compared to every other. Dangerous at scale.

# O(n²) - bubble sort
def bubble_sort(arr):
n = len(arr)
for i in range(n): # O(n)
for j in range(n - i): # O(n) inner
if arr[j] > arr[j+1]:
arr[j], arr[j+1] = arr[j+1], arr[j]

Part 2 - List Complexity: The Hidden Traps

The Memory Model Behind List Operations

A Python list stores an array of pointers in contiguous memory. This single fact determines almost all list complexity.

List internal array (contiguous pointers, capacity=8, ob_size=5):

Index: 0 1 2 3 4 [5] [6] [7]
ob_item: ptr ptr ptr ptr ptr - - -
↑ ob_size = 5

O(1) index access: lst[i] = ob_item + i * 8 (pointer arithmetic, no scan needed).

O(n) insert at front: To insert at position 0, every existing element must shift right by one position.

All 4 existing pointers shift right by one position → O(n) cost.

Complete List Complexity Table

OperationComplexityExplanation
lst[i] (index)O(1)Pointer arithmetic: ob_item + i*8
lst[-1] (last)O(1)ob_item + (ob_size-1)*8
lst.append(x)O(1)*Amortized - writes next pointer
lst.pop() (from end)O(1)*Amortized - decrements ob_size
lst.pop(i) (middle)O(n)Shifts elements left from i+1
lst.pop(0) (from front)O(n)Shifts ALL elements left
lst.insert(0, x)O(n)Shifts ALL elements right
lst.insert(i, x)O(n-i)Shifts elements right from i
x in lst (membership)O(n)Linear scan - no hash available
lst.index(x)O(n)Linear scan
lst.remove(x)O(n)Scan to find + shift to fill gap
lst.sort()O(n log n)Timsort - stable
sorted(lst)O(n log n)Timsort - returns new list
lst.reverse()O(n)Swaps elements front-to-back
len(lst)O(1)Stored in ob_size field
min(lst) / max(lst)O(n)Must check every element
lst + other (concat)O(n + m)Creates new list of size n+m
lst * k (repeat)O(n*k)Creates new list of size n*k
lst[i:j] (slice)O(j - i)Creates new list, copies pointers
lst.extend(iterable)O(k)*k = len(iterable), amortized
lst.count(x)O(n)Scans entire list
lst.copy()O(n)Copies all pointers
lst.clear()O(n)Decrements ref count for each element

*Amortized

Amortized Analysis: Why append() Is O(1)

The key word is amortized. Individual appends can cost O(n) when a resize occurs. But resizes happen at exponentially spaced intervals, so the average across all appends is O(1).

Growth pattern (capacity jumps):
n=0 → capacity=0
n=1 → capacity=4 (resize: copy 0 elements, cost=0)
n=5 → capacity=8 (resize: copy 4 elements, cost=4)
n=9 → capacity=16 (resize: copy 8 elements, cost=8)
n=17 → capacity=25 (resize: copy 16 elements, cost=16)
n=26 → capacity=35 (resize: copy 25 elements, cost=25)

Resize costs: 0, 4, 8, 16, 25, ... (geometric-ish sequence)
Total resize cost for n appends ≈ O(n) (geometric series sum)
Average cost per append = O(n) / n = O(1)
import sys

lst = []
prev_size = sys.getsizeof(lst)

for i in range(20):
lst.append(i)
curr = sys.getsizeof(lst)
if curr != prev_size:
print(f" Resize after {len(lst)} elements: {prev_size}{curr} bytes")
prev_size = curr

Output:

Resize after 1 elements: 56 → 88 bytes
Resize after 5 elements: 88 → 120 bytes
Resize after 9 elements: 120 → 184 bytes
Resize after 17 elements: 184 → 248 bytes

:::tip Amortized vs Guaranteed O(1) Amortized O(1) means the average is O(1), but any individual call may take O(n) during a resize. In real-time systems where latency spikes are unacceptable (game loops, audio processing), pre-allocate with [None] * n to avoid mid-operation resizes. :::

Part 3 - Dictionary Complexity: Hash Tables Under the Hood

How dict Achieves O(1) Average

Python dicts are hash tables. A hash function converts a key to an integer index into an array. Lookup, insertion, and deletion follow the same pattern:

For dict["username"]:

  1. hash("username") = 8734928374 (CPython's hash function)
  2. index = 8734928374 % table_size = 42
  3. Check bucket[42] - if key matches, return value
  4. If not (collision), probe to next slot
  5. Return value

This is why average case is O(1): steps 1-3 take constant time.

Hash Collisions: The O(n) Worst Case

If many keys hash to the same index, lookup degrades:

Pathological case - all keys hash to the same slot: bucket[42] → "username" → "password" → "email" → "token" → ... - must scan the entire chain: O(n).

In practice, CPython's hash function distributes keys well, and the load factor (ratio of used/total slots) is kept below 2/3. When it exceeds that, the table is rebuilt (O(n), amortized O(1) per insert).

OperationAverage CaseWorst Case (all keys collide)
d[key] (get)O(1)O(n)
d[key] = val (set/insert)O(1)O(n)
del d[key]O(1)O(n)
key in d (membership)O(1)O(n)
d.get(key)O(1)O(n)
d.keys()O(1)O(1) (returns a view)
d.values()O(1)O(1) (returns a view)
d.items()O(1)O(1) (returns a view)
len(d)O(1)O(1) (stored in struct)
d.copy()O(n)O(n)
d.update(other)O(len(other))O(len(other))
dict(zip(k, v)) (create)O(n)O(n)

:::warning Only hashable keys work in dicts Dict keys must be hashable - they need a stable __hash__(). Immutable types (int, str, tuple, frozenset) are hashable. Lists and dicts are NOT hashable and cannot be dict keys. This is why d[[1,2]] = "val" raises TypeError: unhashable type: 'list'. :::

# Demonstrating O(1) vs O(n) membership lookup
import time

n = 5_000_000
data_list = list(range(n))
data_dict = {i: True for i in range(n)}

target = n - 1 # worst case: last element

start = time.perf_counter()
_ = target in data_list
list_time = time.perf_counter() - start

start = time.perf_counter()
_ = target in data_dict
dict_time = time.perf_counter() - start

print(f"List membership (O(n)): {list_time:.4f}s")
print(f"Dict membership (O(1)): {dict_time:.7f}s")
print(f"Dict is {list_time/dict_time:.0f}x faster")

Output (approximate):

List membership (O(n)): 0.0421s
Dict membership (O(1)): 0.0000071s
Dict is 5929x faster

Part 4 - Set Complexity: Dict Without Values

Sets share the same hash table implementation as dicts. The complexity profile is nearly identical.

OperationAverage CaseNotes
s.add(x)O(1)Hash to slot, write
s.discard(x)O(1)Hash to slot, delete (no KeyError)
s.remove(x)O(1)Hash to slot, delete (raises KeyError)
x in s (membership)O(1)Hash lookup - key strength of sets
len(s)O(1)Stored in struct
s | t (union)O(len(s)+len(t))Must hash all elements from both
s & t (intersection)O(min(len(s), len(t)))Iterates smaller, checks larger
s - t (difference)O(len(s))Iterates s, checks membership in t
s ^ t (symmetric diff)O(len(s)+len(t))Union minus intersection
s <= t (issubset)O(len(s))Every element of s must be in t
s.copy()O(n)Copies all hash buckets
set(iterable) (create)O(n)Hash each element
# Set intersection complexity: O(min(m, n))
# CPython iterates the smaller set and checks membership in the larger

s1 = set(range(1_000_000)) # 1M elements
s2 = set(range(500, 1000)) # 500 elements

# Python automatically iterates s2 (smaller) and checks against s1 (larger)
intersection = s1 & s2
# Cost: 500 hash lookups in s1 - O(500), not O(1,000,000)
print(len(intersection)) # 500

Part 5 - Tuple Complexity: Immutable Lists

Tuples share the same contiguous-pointer-array structure as lists, but are immutable. This means no insert, remove, or sort operations.

OperationComplexityNotes
t[i] (index)O(1)Same pointer arithmetic as lists
x in t (membership)O(n)Linear scan - no hash
t.count(x)O(n)Full scan
t.index(x)O(n)Scan until found
len(t)O(1)Stored in struct
t + other (concat)O(n + m)Creates new tuple
t * k (repeat)O(n * k)Creates new tuple
t[i:j] (slice)O(j - i)Creates new tuple

Tuples are slightly more memory-efficient than lists (no allocated field needed since they don't resize).

import sys

lst = [1, 2, 3, 4, 5]
tpl = (1, 2, 3, 4, 5)

print(sys.getsizeof(lst)) # 104 bytes (includes extra capacity)
print(sys.getsizeof(tpl)) # 80 bytes (exact fit, no slack)

Part 6 - String Complexity: Concatenation is the Hidden Trap

Strings in Python are immutable sequences of Unicode characters stored contiguously.

OperationComplexityNotes
s[i] (index)O(1)Character at index i
len(s)O(1)Stored in struct
sub in s (membership)O(n × m)n=len(s), m=len(sub) - substring search
s.find(sub)O(n × m)Same - returns index or -1
s + other (concat)O(n + m)Creates new string - each time!
"".join(parts)O(n)One allocation, one pass
s.split()O(n)Scans entire string
s.replace(old, new)O(n)Scans entire string
s.strip()O(n)Scans from both ends
s.lower() / s.upper()O(n)Transforms every character
s * k (repeat)O(n × k)Creates new string

The String Concatenation Trap: O(n²) in Disguise

# WRONG: O(n²) - each + creates a new string
def build_string_bad(parts):
result = ""
for part in parts:
result = result + part # New string allocated every iteration!
return result

# RIGHT: O(n) - one allocation, one copy
def build_string_good(parts):
return "".join(parts)

# Verify with timing
import time
parts = ["x"] * 100_000

start = time.perf_counter()
_ = build_string_bad(parts)
bad_time = time.perf_counter() - start

start = time.perf_counter()
_ = build_string_good(parts)
good_time = time.perf_counter() - start

print(f"Concatenation (+): {bad_time:.3f}s")
print(f"join(): {good_time:.5f}s")
print(f"join() is {bad_time / good_time:.0f}x faster")

Output (approximate):

Concatenation (+): 1.234s
join(): 0.00312s
join() is 396x faster

Why? Each s = s + part allocates a new string of size len(s) + len(part), copying all previous content. Total bytes copied: 1 + 2 + 3 + ... + n = O(n²). join() measures total length first, allocates once, and copies everything - O(n).

:::danger Never Build Strings with + in a Loop This is one of the most common Python performance bugs. Always use "".join(list_of_strings) for building strings from multiple pieces. Or use an io.StringIO buffer for more complex construction. :::

Part 7 - deque: O(1) at Both Ends

collections.deque (doubly-ended queue) is the right tool when you need O(1) insertions or removals from both ends.

O(1) at both ends - no shifting required.

OperationComplexityNotes
dq.append(x) (right)O(1)Write to right end of current block
dq.appendleft(x) (left)O(1)Write to left end - no shifting
dq.pop() (right)O(1)Remove from right end
dq.popleft() (left)O(1)Remove from left end - O(1), not O(n)!
dq[0] / dq[-1] (ends)O(1)Direct pointer to head/tail blocks
dq[i] (middle)O(n)Must traverse blocks - no direct index
x in dq (membership)O(n)Linear scan
len(dq)O(1)Stored in struct
dq.rotate(k)O(k)Rotate elements - efficient for circular

:::note deque vs list: The Core Tradeoff Use deque when you need O(1) at both ends (queues, sliding windows, BFS). Use list when you need O(1) random index access (lst[i]). deque middle-element access is O(n) - linked block traversal. list middle-element access is O(1) - pointer arithmetic. :::

Part 8 - Profiling with timeit: Verify Your Assumptions

Never trust intuition alone. Use timeit to verify complexity empirically.

import timeit
import random

# Verify that list membership is O(n) and set membership is O(1)
setup = """
import random
n = {n}
data_list = list(range(n))
data_set = set(range(n))
target = n - 1 # worst case: last element
"""

sizes = [1_000, 10_000, 100_000, 1_000_000]
print(f"{'n':>12} {'list (ms)':>12} {'set (ms)':>12} {'ratio':>8}")
print("-" * 50)

for n in sizes:
list_time = timeit.timeit(
"target in data_list",
setup=setup.format(n=n),
number=100
) / 100 * 1000

set_time = timeit.timeit(
"target in data_set",
setup=setup.format(n=n),
number=100
) / 100 * 1000

print(f"{n:>12,} {list_time:>11.4f}ms {set_time:>11.7f}ms {list_time/set_time:>7.0f}x")

Output (approximate):

n list (ms) set (ms) ratio
--------------------------------------------------
1,000 0.0098ms 0.0001200ms 82x
10,000 0.0912ms 0.0001180ms 773x
100,000 0.9153ms 0.0001190ms 7691x
1,000,000 9.1720ms 0.0001200ms 76433x

The list time grows linearly (10x more elements = 10x more time). Set time stays nearly constant. This is O(n) vs O(1) made visible.

# Verify that list.insert(0) is O(n) by timing at different sizes
import timeit

print(f"{'n':>10} {'insert(0) ms':>15} {'append ms':>12} {'ratio':>8}")
print("-" * 50)

for n in [1_000, 10_000, 100_000]:
insert_time = timeit.timeit(
"lst.insert(0, 0)",
setup=f"lst = list(range({n}))",
number=1000
) / 1000 * 1000

append_time = timeit.timeit(
"lst.append(0); lst.pop()",
setup=f"lst = list(range({n}))",
number=1000
) / 1000 * 1000

print(f"{n:>10,} {insert_time:>14.4f}ms {append_time:>11.5f}ms {insert_time/append_time:>7.0f}x")

Output:

n insert(0) ms append ms ratio
--------------------------------------------------
1,000 0.0041ms 0.00018ms 23x
10,000 0.0347ms 0.00018ms 193x
100,000 0.3511ms 0.00019ms 1848x

Every 10x increase in n causes ~10x increase in insert(0) time - confirming O(n). append() time stays flat - confirming O(1) amortized.

Part 9 - Choosing the Right Data Structure

The right data structure is determined by which operations dominate your workload.

Dominant OperationBest ChoiceWhy
Random access by indexlistO(1) pointer arithmetic
Frequent membership testingset or dictO(1) hash lookup
Key-value mappingdictO(1) hash lookup
Preserve insertion order + dedupdict (Python 3.7+)Ordered + O(1) lookup
Front+back insertion/removaldequeO(1) at both ends
Priority-based retrievalheapqO(log n) push/pop
Sorted order, binary searchsorted list + bisectO(log n) search
Immutable sequence (hashable)tupleSmaller, hashable
Unique elements + set mathsetO(1) membership, O(n) ops

Real-World Example: API Token Validation

# Scenario: validate incoming API tokens against 100,000 active tokens
# 10,000 requests per second

import time
import random
import string

def generate_tokens(n):
return ["".join(random.choices(string.ascii_letters, k=32)) for _ in range(n)]

active_tokens = generate_tokens(100_000)
incoming_token = active_tokens[99_999] # worst case: last in list

# Option A: List - O(n) per check
tokens_list = active_tokens
start = time.perf_counter()
for _ in range(10_000):
_ = incoming_token in tokens_list
list_time = time.perf_counter() - start

# Option B: Set - O(1) per check
tokens_set = set(active_tokens)
start = time.perf_counter()
for _ in range(10_000):
_ = incoming_token in tokens_set
set_time = time.perf_counter() - start

print(f"List (O(n) each): {list_time:.2f}s for 10,000 checks")
print(f"Set (O(1) each): {set_time:.4f}s for 10,000 checks")
print(f"Set is {list_time / set_time:.0f}x faster")

Output:

List (O(n) each): 8.43s for 10,000 checks
Set (O(1) each): 0.0012s for 10,000 checks
Set is 7025x faster

With a list, 10,000 requests/second × O(n) = 10,000 × 100,000 = 1 billion operations per second needed. Impossible. With a set, 10,000 × O(1) = 10,000 operations. Trivial.

Interview Questions

Q1: What is the difference between O(1) and amortized O(1)?

Answer: O(1) means every single call takes constant time with no exceptions. Amortized O(1) means the average cost per call is O(1) when evaluated over a sequence of operations, but any individual call may take more. list.append() is amortized O(1): most calls write one pointer (O(1)), but occasional resize operations copy the entire array (O(n)). Because resizes happen at exponentially spaced intervals, the total cost over n appends is O(n), making the average O(1). In real-time systems requiring bounded latency, amortized O(1) may not be acceptable - prefer deque or pre-allocated arrays.

Q2: Why is list.pop(0) O(n) while deque.popleft() is O(1)?

Answer: A Python list stores elements in a contiguous block of memory as an array of pointers. Removing the first element requires shifting every remaining element one position to the left to maintain contiguity - O(n) shifts for n elements. A deque is implemented as a doubly-linked list of fixed-size blocks. Removing the leftmost element just updates a pointer to the next block - O(1), no shifting required. For any algorithm that requires frequent front-removal (BFS, sliding window, FIFO queue), always use deque, not list.

Q3: When can dict lookup degrade from O(1) to O(n)?

Answer: Dict lookup is O(1) average because it relies on hash functions distributing keys uniformly across buckets. In the worst case, if all keys produce the same hash value (a hash collision), all keys end up in the same bucket, and lookup becomes a linear scan - O(n). In practice, CPython uses a high-quality hash function and maintains a load factor below 2/3 (expanding the table when it fills up), making true worst-case collisions extremely rare for normal keys. However, in adversarial contexts (e.g., user-controlled input in older Python versions), attackers could craft keys to cause all collisions. Python 3.3+ added hash randomization (PYTHONHASHSEED) to mitigate this.

Q4: What is the time complexity of set intersection s1 & s2, and does it matter which set is larger?

Answer: Set intersection is O(min(len(s1), len(s2))). Python iterates the smaller set and checks each element's membership in the larger set. Since membership in a set is O(1), the total cost is O(smaller set size). This means s1 & s2 where len(s1) = 10 and len(s2) = 10,000,000 takes only ~10 hash lookups - not 10 million. Python automatically chooses the optimal iteration direction, so you don't need to order the operands yourself. For union, both sets must be traversed: O(m + n).

Q5: Explain why building a string with += in a loop is O(n²).

Answer: Python strings are immutable. Every s = s + part must allocate a new string object, copy all characters from s (length grows by 1 each iteration), and copy characters from part. After n iterations, total characters copied = 1 + 2 + 3 + ... + n = n(n+1)/2 = O(n²). The fix is "".join(parts): Python first computes the total length (one O(n) pass), allocates exactly the right buffer, then copies each part in (one more O(n) pass). Total: O(n). For 100,000 parts, this is the difference between 100,000² = 10¹⁰ operations and 200,000 operations - roughly a 50,000x speedup.

Q6: How would you deduplicate a list of 10 million items while preserving insertion order, efficiently?

Answer: The naive approach - checking if item not in seen_list - is O(n²) because in on a list is O(n). The correct approach uses a dict (or set) for O(1) membership tracking while appending to a list to preserve order:

def deduplicate_ordered(items):
seen = set()
result = []
for item in items:
if item not in seen: # O(1) average
seen.add(item) # O(1) average
result.append(item) # O(1) amortized
return result

Total: O(n). In Python 3.7+, dict.fromkeys(items) is an even more concise solution: list(dict.fromkeys(items)). This works because dict preserves insertion order and dict.fromkeys silently ignores duplicate keys, maintaining only the first occurrence.

Practice Challenges

Beginner: Predict the Output

What is the time complexity of this code, and approximately how long would it take for n = 1,000,000?

def process(items):
result = []
for item in items:
if item not in result: # <-- this line
result.append(item)
return result
Solution

Complexity: O(n²)

The line if item not in result performs a linear scan of result on every iteration of the outer loop. In the worst case (all unique items), after processing k items, result has k elements, so the membership check costs O(k). Summed over all n items: O(1 + 2 + ... + n) = O(n²/2) = O(n²).

For n = 1,000,000: approximately 500,000,000,000 (500 billion) operations - at ~100M operations/second, this would take ~1.4 hours.

Fix:

def process(items):
seen = set()
result = []
for item in items:
if item not in seen: # O(1) average
seen.add(item)
result.append(item)
return result

Fixed complexity: O(n). For n = 1,000,000: ~1,000,000 operations - completes in milliseconds.

Intermediate: Choose and Justify

You are building a cache for an API that stores up to 10,000 user session tokens. The operations are:

  1. Check if a token is valid (called 50,000 times/second)
  2. Add a new token when a user logs in
  3. Remove a token when a user logs out
  4. Occasionally list all active tokens (called 1 time/second)

Which data structure should you use for the cache, and why?

Solution

Best choice: set

Analysis by operation:

OperationFrequencysetlistdict
Token validation (in)50,000/sO(1)O(n)O(1)
Add tokenrareO(1)O(1)*O(1)
Remove tokenrareO(1)O(n)O(1)
List all tokens1/sO(n)O(n)O(n)

The dominant operation is validation (50,000/s). list would require 50,000 × O(10,000) = 500 million operations per second just for validation - impossible. set and dict both offer O(1) validation.

Choose set over dict because you only need the token itself (no associated value needed). set uses slightly less memory than dict (no value slots).

class SessionCache:
def __init__(self):
self._tokens = set()

def is_valid(self, token: str) -> bool:
return token in self._tokens # O(1)

def add(self, token: str) -> None:
self._tokens.add(token) # O(1)

def remove(self, token: str) -> None:
self._tokens.discard(token) # O(1), no KeyError

def list_all(self):
return list(self._tokens) # O(n), called rarely

If you also need to track expiry timestamps per token (TTL), upgrade to dict[str, datetime] - same O(1) lookup, but with associated values.

Advanced: Design a Complexity-Aware Data Pipeline

You receive a stream of 50 million log events. Each event is a dict with user_id, action, and timestamp. Design a pipeline that:

  1. Deduplicates events (same user_id + action within 1 second)
  2. Counts unique users per action
  3. Returns the top 5 most common actions

Requirements: O(n) time, O(k) space where k is the number of unique user-action pairs seen in any 1-second window (k << n).

Solution
from collections import defaultdict
import heapq

def process_event_stream(events):
"""
O(n) time: one pass through all events
O(k) space: only stores dedup keys for 1-second window + action counts

Strategy:
- Use a set for dedup within each 1-second bucket (O(1) lookup/insert)
- Use a dict to count unique users per action (O(1) updates)
- Use heapq.nlargest for top-5 (O(n log 5) = O(n) since 5 is constant)
"""
# Dedup: track (user_id, action) pairs seen per second
# Key: timestamp (truncated to second), Value: set of (user_id, action)
seen_per_second = defaultdict(set) # O(k) space per second

# Count unique users per action
action_user_counts = defaultdict(set) # action -> set of user_ids

for event in events: # O(n) iterations
user_id = event["user_id"]
action = event["action"]
second = int(event["timestamp"]) # O(1)

dedup_key = (user_id, action)

# O(1) average: check and add to set
if dedup_key not in seen_per_second[second]:
seen_per_second[second].add(dedup_key) # O(1)
action_user_counts[action].add(user_id) # O(1)

# Evict buckets older than 1 second to bound memory
# (In production, you'd track current max second and evict old ones)

# Convert sets to counts: O(number of unique actions)
action_counts = {
action: len(users)
for action, users in action_user_counts.items()
}

# Top 5: O(a log 5) where a = number of unique actions
# Since 5 is constant, this is O(a) ≈ O(1) relative to n
top_5 = heapq.nlargest(5, action_counts.items(), key=lambda x: x[1])

return top_5


# Example usage
sample_events = [
{"user_id": "u1", "action": "click", "timestamp": 1700000000.1},
{"user_id": "u1", "action": "click", "timestamp": 1700000000.5}, # deduped
{"user_id": "u2", "action": "click", "timestamp": 1700000000.3},
{"user_id": "u1", "action": "purchase", "timestamp": 1700000001.2}, # new second
{"user_id": "u3", "action": "view", "timestamp": 1700000000.9},
]

result = process_event_stream(sample_events)
print(result)
# [('click', 2), ('view', 1), ('purchase', 1)]

Why this is O(n):

  • The main loop runs exactly once per event: O(n)
  • All operations inside the loop (set lookup, set add, dict access) are O(1) average
  • The final heapq.nlargest(5, ...) is O(a) where a = unique action count (tiny vs n)
  • Total: O(n + a) ≈ O(n)

Space complexity:

  • seen_per_second: O(k) where k = unique (user, action) pairs per second
  • action_user_counts: O(a × u) where a = unique actions, u = unique users
  • Both are bounded by problem constants, not n

Quick Reference

Data StructureAccessSearchInsert (end)Insert (front)Delete
listO(1)O(n)O(1)*O(n)O(n)
dequeO(n)O(n)O(1)O(1)O(n)
dictO(1)**O(1)**O(1)**N/AO(1)**
setN/AO(1)**O(1)**N/AO(1)**
tupleO(1)O(n)N/AN/AN/A
strO(1)O(n*m)N/A (immutable)N/AN/A

* Amortized | ** Average case

Operation PatternUse ThisNot This
Frequent x in collectionset or dictlist or tuple
Front + back append/popdequelist
Key-value mappingdictlist of tuples
Build string from parts"".join(parts)s += part in loop
Top-k elementsheapq.nlargest(k, data)sorted(data)[-k:]
Deduplicate + preserve orderdict.fromkeys(data)list with in check
Binary searchbisect on sorted listLinear scan on list

Key Takeaways

  • O(1) vs O(n) is the most impactful complexity boundary in Python: choosing set over list for membership testing can be 1000-100,000x faster at scale
  • Amortized O(1) means the average is O(1), but individual operations can spike - for latency-sensitive systems, pre-allocate or use deque
  • String concatenation with + in a loop is O(n²) - always use "".join(parts) for building strings from multiple pieces
  • list.insert(0, x) and list.pop(0) are O(n) because the entire array must shift; use deque.appendleft() / deque.popleft() for O(1) front operations
  • Dict and set operations are O(1) average, O(n) worst case due to hash collisions - CPython's hash randomization makes worst-case rare in practice
  • Set intersection is O(min(m, n)) - Python automatically iterates the smaller set, so operand order doesn't matter
  • Verify with timeit - empirical profiling confirms theoretical complexity and reveals constant-factor differences that Big-O hides
© 2026 EngineersOfAI. All rights reserved.