Time Complexity of Python Data Structures - Big-O in Practice

Reading time: ~20 minutes | Level: Foundation → Engineering

Here is a question that catches most Python developers off guard.

import time

n = 100_000

# Method A: insert at front of list
data_a = []
start = time.perf_counter()
for i in range(n):
    data_a.insert(0, i)
elapsed_a = time.perf_counter() - start

# Method B: appendleft on deque
from collections import deque
data_b = deque()
start = time.perf_counter()
for i in range(n):
    data_b.appendleft(i)
elapsed_b = time.perf_counter() - start

print(f"list.insert(0): {elapsed_a:.3f}s")
print(f"deque.appendleft: {elapsed_b:.4f}s")
print(f"Ratio: {elapsed_a / elapsed_b:.0f}x slower")

Actual output on a modern machine:

list.insert(0): 1.847s
deque.appendleft: 0.0071s
Ratio: 260x slower

Both produce the same result. The performance difference is 260x. The cause is purely data structure choice - and it comes down to Big-O.

What You Will Learn

The Big-O complexity classes that matter for engineering: O(1), O(log n), O(n), O(n log n), O(n²)
Exact time complexity for every operation on list, dict, set, tuple, string, and deque
Why list.insert(0, x) is O(n) - the memory shift required in C
Why deque.appendleft() is O(1) - the doubly-linked structure
Amortized analysis: why list.append() is O(1) despite occasional O(n) resizes
How hash collisions affect dict and set worst-case behavior
How to profile code with timeit to empirically verify complexity
A decision framework for choosing the right data structure based on workload

Prerequisites

Python lists, dicts, sets, tuples (Module 4, Lessons 1–8)
Basic familiarity with Python functions and loops
Python Lists Internals (Lesson 1 of this module) for memory model background

Part 1 - Big-O Notation: The Engineer's Vocabulary

Big-O notation describes how runtime grows as input size grows. It is not about seconds - it is about scaling behavior.

Complexity	n=10	n=1,000	n=1,000,000	n=1,000,000,000
O(1)	1	1	1	1
O(log n)	3	10	20	30
O(n)	10	1,000	1,000,000	1,000,000,000
O(n log n)	33	10,000	20,000,000	30,000,000,000
O(n²)	100	1,000,000	10¹²	10¹⁸

The jump from O(n) to O(n²) at n=1,000,000 is the difference between completing in 1 second and taking 11 days.

:::note What Big-O Ignores Big-O drops constants and lower-order terms. O(100n) and O(n) are both written as O(n). This is intentional - constants depend on hardware, and the growth curve dominates at scale. When constants matter (e.g., n is always small), profiling beats Big-O analysis. :::

The Complexity Classes You Will Encounter Daily

O(1) - Constant: Runtime does not change regardless of input size.

# All O(1)
d = {"key": "value"}
x = d["key"]          # Hash lookup - O(1) average
lst = [1, 2, 3, 4, 5]
y = lst[2]            # Pointer arithmetic - always O(1)
n = len(lst)          # Stored in struct - O(1)

O(log n) - Logarithmic: Runtime grows by 1 for every doubling of input.

import bisect

# Binary search on sorted list - O(log n)
data = list(range(1_000_000))
idx = bisect.bisect_left(data, 500_000)   # ~20 comparisons for 1M elements

O(n) - Linear: Runtime grows proportionally with input size.

# All O(n) - must touch every element
total = sum([1, 2, 3, ..., n])   # O(n)
found = 42 in my_list            # O(n) - scan until found or end

O(n log n) - Linearithmic: The best achievable for comparison-based sorting.

data = [5, 3, 8, 1, 9, 2]
data.sort()   # O(n log n) - Timsort

O(n²) - Quadratic: Every element compared to every other. Dangerous at scale.

# O(n²) - bubble sort
def bubble_sort(arr):
    n = len(arr)
    for i in range(n):           # O(n)
        for j in range(n - i):   # O(n) inner
            if arr[j] > arr[j+1]:
                arr[j], arr[j+1] = arr[j+1], arr[j]

Part 2 - List Complexity: The Hidden Traps

The Memory Model Behind List Operations

A Python list stores an array of pointers in contiguous memory. This single fact determines almost all list complexity.

List internal array (contiguous pointers, capacity=8, ob_size=5):

Index:   0     1     2     3     4    [5]   [6]   [7]
ob_item: ptr   ptr   ptr   ptr   ptr   -     -     -
                                  ↑ ob_size = 5

O(1) index access: lst[i] = ob_item + i * 8 (pointer arithmetic, no scan needed).

O(n) insert at front: To insert at position 0, every existing element must shift right by one position.

All 4 existing pointers shift right by one position → O(n) cost.

Complete List Complexity Table

Operation	Complexity	Explanation
`lst[i]` (index)	O(1)	Pointer arithmetic: `ob_item + i*8`
`lst[-1]` (last)	O(1)	`ob_item + (ob_size-1)*8`
`lst.append(x)`	O(1)*	Amortized - writes next pointer
`lst.pop()` (from end)	O(1)*	Amortized - decrements ob_size
`lst.pop(i)` (middle)	O(n)	Shifts elements left from i+1
`lst.pop(0)` (from front)	O(n)	Shifts ALL elements left
`lst.insert(0, x)`	O(n)	Shifts ALL elements right
`lst.insert(i, x)`	O(n-i)	Shifts elements right from i
`x in lst` (membership)	O(n)	Linear scan - no hash available
`lst.index(x)`	O(n)	Linear scan
`lst.remove(x)`	O(n)	Scan to find + shift to fill gap
`lst.sort()`	O(n log n)	Timsort - stable
`sorted(lst)`	O(n log n)	Timsort - returns new list
`lst.reverse()`	O(n)	Swaps elements front-to-back
`len(lst)`	O(1)	Stored in ob_size field
`min(lst)` / `max(lst)`	O(n)	Must check every element
`lst + other` (concat)	O(n + m)	Creates new list of size n+m
`lst * k` (repeat)	O(n*k)	Creates new list of size n*k
`lst[i:j]` (slice)	O(j - i)	Creates new list, copies pointers
`lst.extend(iterable)`	O(k)*	k = len(iterable), amortized
`lst.count(x)`	O(n)	Scans entire list
`lst.copy()`	O(n)	Copies all pointers
`lst.clear()`	O(n)	Decrements ref count for each element

*Amortized

Amortized Analysis: Why append() Is O(1)

The key word is amortized. Individual appends can cost O(n) when a resize occurs. But resizes happen at exponentially spaced intervals, so the average across all appends is O(1).

Growth pattern (capacity jumps):
n=0  → capacity=0
n=1  → capacity=4   (resize: copy 0 elements, cost=0)
n=5  → capacity=8   (resize: copy 4 elements, cost=4)
n=9  → capacity=16  (resize: copy 8 elements, cost=8)
n=17 → capacity=25  (resize: copy 16 elements, cost=16)
n=26 → capacity=35  (resize: copy 25 elements, cost=25)

Resize costs: 0, 4, 8, 16, 25, ... (geometric-ish sequence)
Total resize cost for n appends ≈ O(n) (geometric series sum)
Average cost per append = O(n) / n = O(1)

import sys

lst = []
prev_size = sys.getsizeof(lst)

for i in range(20):
    lst.append(i)
    curr = sys.getsizeof(lst)
    if curr != prev_size:
        print(f"  Resize after {len(lst)} elements: {prev_size} → {curr} bytes")
        prev_size = curr

Output:

  Resize after 1 elements: 56 → 88 bytes
  Resize after 5 elements: 88 → 120 bytes
  Resize after 9 elements: 120 → 184 bytes
  Resize after 17 elements: 184 → 248 bytes

:::tip Amortized vs Guaranteed O(1) Amortized O(1) means the average is O(1), but any individual call may take O(n) during a resize. In real-time systems where latency spikes are unacceptable (game loops, audio processing), pre-allocate with [None] * n to avoid mid-operation resizes. :::

Part 3 - Dictionary Complexity: Hash Tables Under the Hood

How dict Achieves O(1) Average

Python dicts are hash tables. A hash function converts a key to an integer index into an array. Lookup, insertion, and deletion follow the same pattern:

For dict["username"]:

hash("username") = 8734928374 (CPython's hash function)
index = 8734928374 % table_size = 42
Check bucket[42] - if key matches, return value
If not (collision), probe to next slot
Return value

This is why average case is O(1): steps 1-3 take constant time.

Hash Collisions: The O(n) Worst Case

If many keys hash to the same index, lookup degrades:

Pathological case - all keys hash to the same slot: bucket[42] → "username" → "password" → "email" → "token" → ... - must scan the entire chain: O(n).

In practice, CPython's hash function distributes keys well, and the load factor (ratio of used/total slots) is kept below 2/3. When it exceeds that, the table is rebuilt (O(n), amortized O(1) per insert).

Operation	Average Case	Worst Case (all keys collide)
`d[key]` (get)	O(1)	O(n)
`d[key] = val` (set/insert)	O(1)	O(n)
`del d[key]`	O(1)	O(n)
`key in d` (membership)	O(1)	O(n)
`d.get(key)`	O(1)	O(n)
`d.keys()`	O(1)	O(1) (returns a view)
`d.values()`	O(1)	O(1) (returns a view)
`d.items()`	O(1)	O(1) (returns a view)
`len(d)`	O(1)	O(1) (stored in struct)
`d.copy()`	O(n)	O(n)
`d.update(other)`	O(len(other))	O(len(other))
`dict(zip(k, v))` (create)	O(n)	O(n)

:::warning Only hashable keys work in dicts Dict keys must be hashable - they need a stable __hash__(). Immutable types (int, str, tuple, frozenset) are hashable. Lists and dicts are NOT hashable and cannot be dict keys. This is why d[[1,2]] = "val" raises TypeError: unhashable type: 'list'. :::

# Demonstrating O(1) vs O(n) membership lookup
import time

n = 5_000_000
data_list = list(range(n))
data_dict = {i: True for i in range(n)}

target = n - 1   # worst case: last element

start = time.perf_counter()
_ = target in data_list
list_time = time.perf_counter() - start

start = time.perf_counter()
_ = target in data_dict
dict_time = time.perf_counter() - start

print(f"List membership (O(n)):  {list_time:.4f}s")
print(f"Dict membership (O(1)):  {dict_time:.7f}s")
print(f"Dict is {list_time/dict_time:.0f}x faster")

Output (approximate):

List membership (O(n)):  0.0421s
Dict membership (O(1)):  0.0000071s
Dict is 5929x faster

Part 4 - Set Complexity: Dict Without Values

Sets share the same hash table implementation as dicts. The complexity profile is nearly identical.

Operation	Average Case	Notes
`s.add(x)`	O(1)	Hash to slot, write
`s.discard(x)`	O(1)	Hash to slot, delete (no KeyError)
`s.remove(x)`	O(1)	Hash to slot, delete (raises KeyError)
`x in s` (membership)	O(1)	Hash lookup - key strength of sets
`len(s)`	O(1)	Stored in struct
`s \| t` (union)	O(len(s)+len(t))	Must hash all elements from both
`s & t` (intersection)	O(min(len(s), len(t)))	Iterates smaller, checks larger
`s - t` (difference)	O(len(s))	Iterates s, checks membership in t
`s ^ t` (symmetric diff)	O(len(s)+len(t))	Union minus intersection
`s <= t` (issubset)	O(len(s))	Every element of s must be in t
`s.copy()`	O(n)	Copies all hash buckets
`set(iterable)` (create)	O(n)	Hash each element

# Set intersection complexity: O(min(m, n))
# CPython iterates the smaller set and checks membership in the larger

s1 = set(range(1_000_000))   # 1M elements
s2 = set(range(500, 1000))   # 500 elements

# Python automatically iterates s2 (smaller) and checks against s1 (larger)
intersection = s1 & s2
# Cost: 500 hash lookups in s1 - O(500), not O(1,000,000)
print(len(intersection))   # 500

Part 5 - Tuple Complexity: Immutable Lists

Tuples share the same contiguous-pointer-array structure as lists, but are immutable. This means no insert, remove, or sort operations.

Operation	Complexity	Notes
`t[i]` (index)	O(1)	Same pointer arithmetic as lists
`x in t` (membership)	O(n)	Linear scan - no hash
`t.count(x)`	O(n)	Full scan
`t.index(x)`	O(n)	Scan until found
`len(t)`	O(1)	Stored in struct
`t + other` (concat)	O(n + m)	Creates new tuple
`t * k` (repeat)	O(n * k)	Creates new tuple
`t[i:j]` (slice)	O(j - i)	Creates new tuple

Tuples are slightly more memory-efficient than lists (no allocated field needed since they don't resize).

import sys

lst = [1, 2, 3, 4, 5]
tpl = (1, 2, 3, 4, 5)

print(sys.getsizeof(lst))   # 104 bytes (includes extra capacity)
print(sys.getsizeof(tpl))   # 80 bytes  (exact fit, no slack)

Part 6 - String Complexity: Concatenation is the Hidden Trap

Strings in Python are immutable sequences of Unicode characters stored contiguously.

Operation	Complexity	Notes
`s[i]` (index)	O(1)	Character at index i
`len(s)`	O(1)	Stored in struct
`sub in s` (membership)	O(n × m)	n=len(s), m=len(sub) - substring search
`s.find(sub)`	O(n × m)	Same - returns index or -1
`s + other` (concat)	O(n + m)	Creates new string - each time!
`"".join(parts)`	O(n)	One allocation, one pass
`s.split()`	O(n)	Scans entire string
`s.replace(old, new)`	O(n)	Scans entire string
`s.strip()`	O(n)	Scans from both ends
`s.lower()` / `s.upper()`	O(n)	Transforms every character
`s * k` (repeat)	O(n × k)	Creates new string

The String Concatenation Trap: O(n²) in Disguise

# WRONG: O(n²) - each + creates a new string
def build_string_bad(parts):
    result = ""
    for part in parts:
        result = result + part   # New string allocated every iteration!
    return result

# RIGHT: O(n) - one allocation, one copy
def build_string_good(parts):
    return "".join(parts)

# Verify with timing
import time
parts = ["x"] * 100_000

start = time.perf_counter()
_ = build_string_bad(parts)
bad_time = time.perf_counter() - start

start = time.perf_counter()
_ = build_string_good(parts)
good_time = time.perf_counter() - start

print(f"Concatenation (+):  {bad_time:.3f}s")
print(f"join():             {good_time:.5f}s")
print(f"join() is {bad_time / good_time:.0f}x faster")

Output (approximate):

Concatenation (+):  1.234s
join():             0.00312s
join() is 396x faster

Why? Each s = s + part allocates a new string of size len(s) + len(part), copying all previous content. Total bytes copied: 1 + 2 + 3 + ... + n = O(n²). join() measures total length first, allocates once, and copies everything - O(n).

:::danger Never Build Strings with + in a Loop This is one of the most common Python performance bugs. Always use "".join(list_of_strings) for building strings from multiple pieces. Or use an io.StringIO buffer for more complex construction. :::

Part 7 - deque: O(1) at Both Ends

collections.deque (doubly-ended queue) is the right tool when you need O(1) insertions or removals from both ends.

O(1) at both ends - no shifting required.

Operation	Complexity	Notes
`dq.append(x)` (right)	O(1)	Write to right end of current block
`dq.appendleft(x)` (left)	O(1)	Write to left end - no shifting
`dq.pop()` (right)	O(1)	Remove from right end
`dq.popleft()` (left)	O(1)	Remove from left end - O(1), not O(n)!
`dq[0]` / `dq[-1]` (ends)	O(1)	Direct pointer to head/tail blocks
`dq[i]` (middle)	O(n)	Must traverse blocks - no direct index
`x in dq` (membership)	O(n)	Linear scan
`len(dq)`	O(1)	Stored in struct
`dq.rotate(k)`	O(k)	Rotate elements - efficient for circular

:::note deque vs list: The Core Tradeoff Use deque when you need O(1) at both ends (queues, sliding windows, BFS). Use list when you need O(1) random index access (lst[i]). deque middle-element access is O(n) - linked block traversal. list middle-element access is O(1) - pointer arithmetic. :::

Part 8 - Profiling with timeit: Verify Your Assumptions

Never trust intuition alone. Use timeit to verify complexity empirically.

import timeit
import random

# Verify that list membership is O(n) and set membership is O(1)
setup = """
import random
n = {n}
data_list = list(range(n))
data_set = set(range(n))
target = n - 1  # worst case: last element
"""

sizes = [1_000, 10_000, 100_000, 1_000_000]
print(f"{'n':>12} {'list (ms)':>12} {'set (ms)':>12} {'ratio':>8}")
print("-" * 50)

for n in sizes:
    list_time = timeit.timeit(
        "target in data_list",
        setup=setup.format(n=n),
        number=100
    ) / 100 * 1000

    set_time = timeit.timeit(
        "target in data_set",
        setup=setup.format(n=n),
        number=100
    ) / 100 * 1000

    print(f"{n:>12,} {list_time:>11.4f}ms {set_time:>11.7f}ms {list_time/set_time:>7.0f}x")

Output (approximate):

           n     list (ms)      set (ms)    ratio
--------------------------------------------------
       1,000       0.0098ms    0.0001200ms      82x
      10,000       0.0912ms    0.0001180ms     773x
     100,000       0.9153ms    0.0001190ms    7691x
   1,000,000       9.1720ms    0.0001200ms   76433x

The list time grows linearly (10x more elements = 10x more time). Set time stays nearly constant. This is O(n) vs O(1) made visible.

# Verify that list.insert(0) is O(n) by timing at different sizes
import timeit

print(f"{'n':>10} {'insert(0) ms':>15} {'append ms':>12} {'ratio':>8}")
print("-" * 50)

for n in [1_000, 10_000, 100_000]:
    insert_time = timeit.timeit(
        "lst.insert(0, 0)",
        setup=f"lst = list(range({n}))",
        number=1000
    ) / 1000 * 1000

    append_time = timeit.timeit(
        "lst.append(0); lst.pop()",
        setup=f"lst = list(range({n}))",
        number=1000
    ) / 1000 * 1000

    print(f"{n:>10,} {insert_time:>14.4f}ms {append_time:>11.5f}ms {insert_time/append_time:>7.0f}x")

Output:

         n   insert(0) ms    append ms    ratio
--------------------------------------------------
     1,000         0.0041ms     0.00018ms      23x
    10,000         0.0347ms     0.00018ms     193x
   100,000         0.3511ms     0.00019ms    1848x

Every 10x increase in n causes ~10x increase in insert(0) time - confirming O(n). append() time stays flat - confirming O(1) amortized.

Part 9 - Choosing the Right Data Structure

The right data structure is determined by which operations dominate your workload.

Dominant Operation	Best Choice	Why
Random access by index	`list`	O(1) pointer arithmetic
Frequent membership testing	`set` or `dict`	O(1) hash lookup
Key-value mapping	`dict`	O(1) hash lookup
Preserve insertion order + dedup	`dict` (Python 3.7+)	Ordered + O(1) lookup
Front+back insertion/removal	`deque`	O(1) at both ends
Priority-based retrieval	`heapq`	O(log n) push/pop
Sorted order, binary search	sorted list + `bisect`	O(log n) search
Immutable sequence (hashable)	`tuple`	Smaller, hashable
Unique elements + set math	`set`	O(1) membership, O(n) ops

Real-World Example: API Token Validation

# Scenario: validate incoming API tokens against 100,000 active tokens
# 10,000 requests per second

import time
import random
import string

def generate_tokens(n):
    return ["".join(random.choices(string.ascii_letters, k=32)) for _ in range(n)]

active_tokens = generate_tokens(100_000)
incoming_token = active_tokens[99_999]  # worst case: last in list

# Option A: List - O(n) per check
tokens_list = active_tokens
start = time.perf_counter()
for _ in range(10_000):
    _ = incoming_token in tokens_list
list_time = time.perf_counter() - start

# Option B: Set - O(1) per check
tokens_set = set(active_tokens)
start = time.perf_counter()
for _ in range(10_000):
    _ = incoming_token in tokens_set
set_time = time.perf_counter() - start

print(f"List (O(n) each): {list_time:.2f}s for 10,000 checks")
print(f"Set  (O(1) each): {set_time:.4f}s for 10,000 checks")
print(f"Set is {list_time / set_time:.0f}x faster")

Output:

List (O(n) each): 8.43s for 10,000 checks
Set  (O(1) each): 0.0012s for 10,000 checks
Set is 7025x faster

With a list, 10,000 requests/second × O(n) = 10,000 × 100,000 = 1 billion operations per second needed. Impossible. With a set, 10,000 × O(1) = 10,000 operations. Trivial.

Interview Questions

Q1: What is the difference between O(1) and amortized O(1)?

Answer: O(1) means every single call takes constant time with no exceptions. Amortized O(1) means the average cost per call is O(1) when evaluated over a sequence of operations, but any individual call may take more. list.append() is amortized O(1): most calls write one pointer (O(1)), but occasional resize operations copy the entire array (O(n)). Because resizes happen at exponentially spaced intervals, the total cost over n appends is O(n), making the average O(1). In real-time systems requiring bounded latency, amortized O(1) may not be acceptable - prefer deque or pre-allocated arrays.

Q2: Why is `list.pop(0)` O(n) while `deque.popleft()` is O(1)?

Answer: A Python list stores elements in a contiguous block of memory as an array of pointers. Removing the first element requires shifting every remaining element one position to the left to maintain contiguity - O(n) shifts for n elements. A deque is implemented as a doubly-linked list of fixed-size blocks. Removing the leftmost element just updates a pointer to the next block - O(1), no shifting required. For any algorithm that requires frequent front-removal (BFS, sliding window, FIFO queue), always use deque, not list.

Q3: When can dict lookup degrade from O(1) to O(n)?

Answer: Dict lookup is O(1) average because it relies on hash functions distributing keys uniformly across buckets. In the worst case, if all keys produce the same hash value (a hash collision), all keys end up in the same bucket, and lookup becomes a linear scan - O(n). In practice, CPython uses a high-quality hash function and maintains a load factor below 2/3 (expanding the table when it fills up), making true worst-case collisions extremely rare for normal keys. However, in adversarial contexts (e.g., user-controlled input in older Python versions), attackers could craft keys to cause all collisions. Python 3.3+ added hash randomization (PYTHONHASHSEED) to mitigate this.

Q4: What is the time complexity of set intersection `s1 & s2`, and does it matter which set is larger?

Answer: Set intersection is O(min(len(s1), len(s2))). Python iterates the smaller set and checks each element's membership in the larger set. Since membership in a set is O(1), the total cost is O(smaller set size). This means s1 & s2 where len(s1) = 10 and len(s2) = 10,000,000 takes only ~10 hash lookups - not 10 million. Python automatically chooses the optimal iteration direction, so you don't need to order the operands yourself. For union, both sets must be traversed: O(m + n).

Q5: Explain why building a string with `+=` in a loop is O(n²).

Answer: Python strings are immutable. Every s = s + part must allocate a new string object, copy all characters from s (length grows by 1 each iteration), and copy characters from part. After n iterations, total characters copied = 1 + 2 + 3 + ... + n = n(n+1)/2 = O(n²). The fix is "".join(parts): Python first computes the total length (one O(n) pass), allocates exactly the right buffer, then copies each part in (one more O(n) pass). Total: O(n). For 100,000 parts, this is the difference between 100,000² = 10¹⁰ operations and 200,000 operations - roughly a 50,000x speedup.

Q6: How would you deduplicate a list of 10 million items while preserving insertion order, efficiently?

Answer: The naive approach - checking if item not in seen_list - is O(n²) because in on a list is O(n). The correct approach uses a dict (or set) for O(1) membership tracking while appending to a list to preserve order:

def deduplicate_ordered(items):
    seen = set()
    result = []
    for item in items:
        if item not in seen:    # O(1) average
            seen.add(item)      # O(1) average
            result.append(item) # O(1) amortized
    return result

Total: O(n). In Python 3.7+, dict.fromkeys(items) is an even more concise solution: list(dict.fromkeys(items)). This works because dict preserves insertion order and dict.fromkeys silently ignores duplicate keys, maintaining only the first occurrence.

Practice Challenges

Beginner: Predict the Output

What is the time complexity of this code, and approximately how long would it take for n = 1,000,000?

def process(items):
    result = []
    for item in items:
        if item not in result:   # <-- this line
            result.append(item)
    return result

Solution

Complexity: O(n²)

The line if item not in result performs a linear scan of result on every iteration of the outer loop. In the worst case (all unique items), after processing k items, result has k elements, so the membership check costs O(k). Summed over all n items: O(1 + 2 + ... + n) = O(n²/2) = O(n²).

For n = 1,000,000: approximately 500,000,000,000 (500 billion) operations - at ~100M operations/second, this would take ~1.4 hours.

Fix:

def process(items):
    seen = set()
    result = []
    for item in items:
        if item not in seen:    # O(1) average
            seen.add(item)
            result.append(item)
    return result

Fixed complexity: O(n). For n = 1,000,000: ~1,000,000 operations - completes in milliseconds.

Intermediate: Choose and Justify

You are building a cache for an API that stores up to 10,000 user session tokens. The operations are:

Check if a token is valid (called 50,000 times/second)
Add a new token when a user logs in
Remove a token when a user logs out
Occasionally list all active tokens (called 1 time/second)

Which data structure should you use for the cache, and why?

Solution

Best choice: set

Analysis by operation:

Operation	Frequency	set	list	dict
Token validation (`in`)	50,000/s	O(1)	O(n)	O(1)
Add token	rare	O(1)	O(1)*	O(1)
Remove token	rare	O(1)	O(n)	O(1)
List all tokens	1/s	O(n)	O(n)	O(n)

The dominant operation is validation (50,000/s). list would require 50,000 × O(10,000) = 500 million operations per second just for validation - impossible. set and dict both offer O(1) validation.

Choose set over dict because you only need the token itself (no associated value needed). set uses slightly less memory than dict (no value slots).

class SessionCache:
    def __init__(self):
        self._tokens = set()

    def is_valid(self, token: str) -> bool:
        return token in self._tokens        # O(1)

    def add(self, token: str) -> None:
        self._tokens.add(token)             # O(1)

    def remove(self, token: str) -> None:
        self._tokens.discard(token)         # O(1), no KeyError

    def list_all(self):
        return list(self._tokens)           # O(n), called rarely

If you also need to track expiry timestamps per token (TTL), upgrade to dict[str, datetime] - same O(1) lookup, but with associated values.

Advanced: Design a Complexity-Aware Data Pipeline

You receive a stream of 50 million log events. Each event is a dict with user_id, action, and timestamp. Design a pipeline that:

Deduplicates events (same user_id + action within 1 second)
Counts unique users per action
Returns the top 5 most common actions

Requirements: O(n) time, O(k) space where k is the number of unique user-action pairs seen in any 1-second window (k << n).

Solution

from collections import defaultdict
import heapq

def process_event_stream(events):
    """
    O(n) time: one pass through all events
    O(k) space: only stores dedup keys for 1-second window + action counts

    Strategy:
    - Use a set for dedup within each 1-second bucket (O(1) lookup/insert)
    - Use a dict to count unique users per action (O(1) updates)
    - Use heapq.nlargest for top-5 (O(n log 5) = O(n) since 5 is constant)
    """
    # Dedup: track (user_id, action) pairs seen per second
    # Key: timestamp (truncated to second), Value: set of (user_id, action)
    seen_per_second = defaultdict(set)      # O(k) space per second

    # Count unique users per action
    action_user_counts = defaultdict(set)   # action -> set of user_ids

    for event in events:                    # O(n) iterations
        user_id = event["user_id"]
        action  = event["action"]
        second  = int(event["timestamp"])   # O(1)

        dedup_key = (user_id, action)

        # O(1) average: check and add to set
        if dedup_key not in seen_per_second[second]:
            seen_per_second[second].add(dedup_key)          # O(1)
            action_user_counts[action].add(user_id)         # O(1)

        # Evict buckets older than 1 second to bound memory
        # (In production, you'd track current max second and evict old ones)

    # Convert sets to counts: O(number of unique actions)
    action_counts = {
        action: len(users)
        for action, users in action_user_counts.items()
    }

    # Top 5: O(a log 5) where a = number of unique actions
    # Since 5 is constant, this is O(a) ≈ O(1) relative to n
    top_5 = heapq.nlargest(5, action_counts.items(), key=lambda x: x[1])

    return top_5


# Example usage
sample_events = [
    {"user_id": "u1", "action": "click", "timestamp": 1700000000.1},
    {"user_id": "u1", "action": "click", "timestamp": 1700000000.5},  # deduped
    {"user_id": "u2", "action": "click", "timestamp": 1700000000.3},
    {"user_id": "u1", "action": "purchase", "timestamp": 1700000001.2},  # new second
    {"user_id": "u3", "action": "view",  "timestamp": 1700000000.9},
]

result = process_event_stream(sample_events)
print(result)
# [('click', 2), ('view', 1), ('purchase', 1)]

Why this is O(n):

The main loop runs exactly once per event: O(n)
All operations inside the loop (set lookup, set add, dict access) are O(1) average
The final heapq.nlargest(5, ...) is O(a) where a = unique action count (tiny vs n)
Total: O(n + a) ≈ O(n)

Space complexity:

seen_per_second: O(k) where k = unique (user, action) pairs per second
action_user_counts: O(a × u) where a = unique actions, u = unique users
Both are bounded by problem constants, not n

Quick Reference

Data Structure	Access	Search	Insert (end)	Insert (front)	Delete
list	O(1)	O(n)	O(1)*	O(n)	O(n)
deque	O(n)	O(n)	O(1)	O(1)	O(n)
dict	O(1)**	O(1)**	O(1)**	N/A	O(1)**
set	N/A	O(1)**	O(1)**	N/A	O(1)**
tuple	O(1)	O(n)	N/A	N/A	N/A
str	O(1)	O(n*m)	N/A (immutable)	N/A	N/A

* Amortized | ** Average case

Operation Pattern	Use This	Not This
Frequent `x in collection`	`set` or `dict`	`list` or `tuple`
Front + back append/pop	`deque`	`list`
Key-value mapping	`dict`	list of tuples
Build string from parts	`"".join(parts)`	`s += part` in loop
Top-k elements	`heapq.nlargest(k, data)`	`sorted(data)[-k:]`
Deduplicate + preserve order	`dict.fromkeys(data)`	`list` with `in` check
Binary search	`bisect` on sorted list	Linear scan on list

Key Takeaways

O(1) vs O(n) is the most impactful complexity boundary in Python: choosing set over list for membership testing can be 1000-100,000x faster at scale
Amortized O(1) means the average is O(1), but individual operations can spike - for latency-sensitive systems, pre-allocate or use deque
String concatenation with + in a loop is O(n²) - always use "".join(parts) for building strings from multiple pieces
list.insert(0, x) and list.pop(0) are O(n) because the entire array must shift; use deque.appendleft() / deque.popleft() for O(1) front operations
Dict and set operations are O(1) average, O(n) worst case due to hash collisions - CPython's hash randomization makes worst-case rare in practice
Set intersection is O(min(m, n)) - Python automatically iterates the smaller set, so operand order doesn't matter
Verify with timeit - empirical profiling confirms theoretical complexity and reveals constant-factor differences that Big-O hides

What You Will Learn​

Prerequisites​

Part 1 - Big-O Notation: The Engineer's Vocabulary​

The Complexity Classes You Will Encounter Daily​

Part 2 - List Complexity: The Hidden Traps​

The Memory Model Behind List Operations​

Complete List Complexity Table​

Amortized Analysis: Why append() Is O(1)​

Part 3 - Dictionary Complexity: Hash Tables Under the Hood​

How dict Achieves O(1) Average​

Hash Collisions: The O(n) Worst Case​

Part 4 - Set Complexity: Dict Without Values​

Part 5 - Tuple Complexity: Immutable Lists​

Part 6 - String Complexity: Concatenation is the Hidden Trap​

The String Concatenation Trap: O(n²) in Disguise​

Part 7 - deque: O(1) at Both Ends​

Part 8 - Profiling with timeit: Verify Your Assumptions​

Part 9 - Choosing the Right Data Structure​

Real-World Example: API Token Validation​

Interview Questions​

Q1: What is the difference between O(1) and amortized O(1)?​

Q2: Why is list.pop(0) O(n) while deque.popleft() is O(1)?​

Q3: When can dict lookup degrade from O(1) to O(n)?​

Q4: What is the time complexity of set intersection s1 & s2, and does it matter which set is larger?​

Q5: Explain why building a string with += in a loop is O(n²).​

Q6: How would you deduplicate a list of 10 million items while preserving insertion order, efficiently?​

Practice Challenges​

Beginner: Predict the Output​

Intermediate: Choose and Justify​

Advanced: Design a Complexity-Aware Data Pipeline​

Quick Reference​

Key Takeaways​

What You Will Learn

Prerequisites

Part 1 - Big-O Notation: The Engineer's Vocabulary

The Complexity Classes You Will Encounter Daily

Part 2 - List Complexity: The Hidden Traps

The Memory Model Behind List Operations

Complete List Complexity Table

Amortized Analysis: Why append() Is O(1)

Part 3 - Dictionary Complexity: Hash Tables Under the Hood

How dict Achieves O(1) Average

Hash Collisions: The O(n) Worst Case

Part 4 - Set Complexity: Dict Without Values

Part 5 - Tuple Complexity: Immutable Lists

Part 6 - String Complexity: Concatenation is the Hidden Trap

The String Concatenation Trap: O(n²) in Disguise

Part 7 - deque: O(1) at Both Ends

Part 8 - Profiling with timeit: Verify Your Assumptions

Part 9 - Choosing the Right Data Structure

Real-World Example: API Token Validation

Interview Questions

Q1: What is the difference between O(1) and amortized O(1)?

Q2: Why is `list.pop(0)` O(n) while `deque.popleft()` is O(1)?

Q3: When can dict lookup degrade from O(1) to O(n)?

Q4: What is the time complexity of set intersection `s1 & s2`, and does it matter which set is larger?

Q5: Explain why building a string with `+=` in a loop is O(n²).

Q6: How would you deduplicate a list of 10 million items while preserving insertion order, efficiently?

Practice Challenges

Beginner: Predict the Output

Intermediate: Choose and Justify

Advanced: Design a Complexity-Aware Data Pipeline

Quick Reference

Key Takeaways