Python Time Complexity of Data Structures: Practice Problems & Exercises

Practice: Time Complexity of Data Structures

11 problems4 Easy4 Medium3 Hard⏱ 40-55 min

Easy

#1Identify Operation ComplexitiesEasy

big-olistdictsetpredict-output

Fill in the correct Big-O complexity for each operation. The function should return a list of strings describing the complexity of each operation.

Python

def identify_complexities():
    """Print the Big-O complexity of each operation."""
    operations = {
        "lst[500]": "O(1)",
        "lst.append(x)": "O(1) amortized",
        "lst.insert(0, x)": "O(n)",
        "x in lst": "O(n)",
        "x in d": "O(1)",
        "x in s": "O(1)",
        "lst.sort()": "O(n log n)",
        "len(lst)": "O(1)",
    }
    for op, complexity in operations.items():
        print(f"{op}: {complexity}")

identify_complexities()

Solution

def identify_complexities():
    operations = {
        "lst[500]": "O(1)",           # Pointer arithmetic: base + i*8
        "lst.append(x)": "O(1) amortized",  # Writes to next slot, occasional resize
        "lst.insert(0, x)": "O(n)",   # Must shift ALL elements right
        "x in lst": "O(n)",           # Linear scan — no hash table
        "x in d": "O(1)",             # Dict hash lookup — average case
        "x in s": "O(1)",             # Set hash lookup — average case
        "lst.sort()": "O(n log n)",   # Timsort — comparison-based sort
        "len(lst)": "O(1)",           # Stored in ob_size field of struct
    }
    for op, complexity in operations.items():
        print(f"{op}: {complexity}")

Key insights:

Index access lst[i] is always O(1) regardless of i — Python lists store contiguous pointers, so any index is a single pointer arithmetic calculation.
Membership in is the biggest trap: O(n) for list/tuple, O(1) average for dict/set. This single difference can cause 10,000x performance gaps at scale.
len() is O(1) for all built-in collections — the length is stored in the struct, not computed by counting.
list.append() is amortized O(1) — most calls are O(1), but occasional resizes cost O(n). The average over n appends is still O(1).

Expected Output

lst[500]: O(1)
lst.append(x): O(1) amortized
lst.insert(0, x): O(n)
x in lst: O(n)
x in d: O(1)
x in s: O(1)
lst.sort(): O(n log n)
len(lst): O(1)

Hints

Hint 1: List index access uses pointer arithmetic — the position does not matter, it is always constant time.

Hint 2: Membership testing (x in collection) is O(n) for lists and tuples (linear scan) but O(1) for dicts and sets (hash lookup).

#2Predict the Performance DifferenceEasy

list-vs-setmembershippredict-output

Predict which approach is faster and by roughly how much for checking if a value exists in a collection of 100,000 elements. Explain the Big-O reasoning.

Python

data_list = list(range(100_000))
data_set = set(range(100_000))

target = 99_999  # worst case for list: last element

# Which is faster?
found_list = target in data_list   # Approach A
found_set = target in data_set     # Approach B

print(f"List lookup: slow (O(n) — scans up to 100,000 elements)")
print(f"Set lookup: fast (O(1) — single hash computation)")
print(f"Set is dramatically faster for membership testing")

Solution

List lookup: slow (O(n) — scans up to 100,000 elements)
Set lookup: fast (O(1) — single hash computation)
Set is dramatically faster for membership testing

Key insights:

target in data_list is O(n) — Python scans from index 0 to index 99,999 checking each element. With the target at the last position, it performs 100,000 comparisons.
target in data_set is O(1) average — Python computes hash(99_999), maps it to a bucket index, and checks one (or very few) slots.
The actual speedup at n=100,000 is roughly 5,000-10,000x. At n=1,000,000, it becomes 50,000-100,000x.
This is why choosing between list and set for membership testing is one of the most impactful performance decisions in Python.

Expected Output

List lookup: slow (O(n) — scans up to 100,000 elements)
Set lookup: fast (O(1) — single hash computation)
Set is dramatically faster for membership testing

Hints

Hint 1: The list must scan element by element until it finds the target or reaches the end. The set computes a hash and jumps directly to the bucket.

Hint 2: For the worst case (target is the last element or not present), list scans all n elements while set still does one hash lookup.

#3Fix the O(n) Front InsertionEasy

dequelistoptimization

The following code inserts elements at the front of a list, which is O(n) per insertion and O(n squared) overall. Rewrite it using collections.deque to achieve O(1) per insertion and O(n) overall.

Python

# SLOW VERSION — O(n²) total
def build_sequence_slow(n):
    result = []
    for i in range(n):
        result.insert(0, i)  # O(n) shift every time!
    return result

print(build_sequence_slow(10))

Solution

from collections import deque

def build_sequence_fast(n):
    result = deque()
    for i in range(n):
        result.appendleft(i)  # O(1) — no shifting
    return list(result)

print(build_sequence_fast(10))
print("Correct!")

Key insights:

list.insert(0, x) is O(n) because it must shift every existing element one position to the right in contiguous memory.
deque.appendleft(x) is O(1) because a deque uses a doubly-linked list of fixed-size blocks — it simply writes to the left end of the leftmost block.
For n=100,000 insertions: list approach takes ~1.8 seconds, deque approach takes ~0.007 seconds — a 260x speedup.
Always use deque when you need frequent insertions or removals at the front.

Starter Code

from collections import deque

def build_sequence_fast(n):
    """Build a sequence by prepending numbers 0..n-1.
    Must use O(1) per insertion instead of O(n).
    Return the result as a regular list.
    """
    # TODO: Replace list.insert(0, x) with a deque-based approach
    pass

Expected Output

[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
Correct!

Hints

Hint 1: deque.appendleft() is O(1) — it writes to the left end without shifting any elements.

Hint 2: You can convert a deque back to a list with list(my_deque) after building it.

#4String Concatenation TrapEasy

stringjoinoptimization

The naive approach builds a CSV string using += in a loop, which is O(n squared) due to string immutability. Rewrite using join() to achieve O(n).

Python

# SLOW VERSION — O(n²)
def build_csv_slow(headers, rows):
    result = ",".join(headers)
    for row in rows:
        result += "\n"           # New string allocated!
        result += ",".join(row)  # Another new string!
    return result

headers = ["name", "age", "city"]
rows = [["alice", "30", "nyc"], ["bob", "25", "sf"], ["charlie", "35", "la"]]
print(build_csv_slow(headers, rows))

Solution

def build_csv_fast(headers, rows):
    lines = [",".join(headers)]
    for row in rows:
        lines.append(",".join(row))
    return "\n".join(lines)

headers = ["name", "age", "city"]
rows = [["alice", "30", "nyc"], ["bob", "25", "sf"], ["charlie", "35", "la"]]
print(build_csv_fast(headers, rows))

Key insights:

Every result += part creates a brand new string, copying all previous content. After k iterations, the copy cost is proportional to k, and summing 1+2+...+n gives O(n squared).
"\n".join(lines) computes total length first, allocates one buffer, then copies each piece exactly once — O(n) total.
For 100,000 rows: += takes ~1.2 seconds, join() takes ~0.003 seconds — roughly 400x faster.
This pattern applies everywhere: log messages, HTML generation, report building. Always collect parts in a list and join once.

Starter Code

def build_csv_fast(headers, rows):
    """Build a CSV string efficiently.
    headers: list of column names
    rows: list of lists (each inner list is one row)
    Return the complete CSV as a single string.
    """
    # TODO: Use join() instead of += in a loop
    pass

Expected Output

name,age,city
alice,30,nyc
bob,25,sf
charlie,35,la

Hints

Hint 1: Use ",".join(row) to join columns within a row, and "\n".join(all_lines) to join all rows.

Hint 2: Build a list of line strings first, then join them all at once — this is O(n) total instead of O(n squared).

Medium

#5Optimize O(n²) Duplicate Detection to O(n)Medium

setoptimizationdeduplication

The naive approach uses a nested loop to find duplicates, running in O(n squared). Rewrite it to run in O(n) using a set for fast membership checking.

Python

# SLOW VERSION — O(n²)
def find_duplicates_slow(items):
    duplicates = []
    for i, item in enumerate(items):
        if item in items[i+1:] and item not in duplicates:
            duplicates.append(item)
    return duplicates

data = [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5]
print(find_duplicates_slow(data))

Solution

def find_duplicates_fast(items):
    seen = set()
    duplicates = []
    duplicates_set = set()  # For O(1) "already in duplicates" check
    for item in items:
        if item in seen and item not in duplicates_set:
            duplicates.append(item)
            duplicates_set.add(item)
        seen.add(item)
    return duplicates

data = [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5]
print(find_duplicates_fast(data))

Why the original is O(n squared):

item in items[i+1:] creates a slice (O(n)) and then scans it (O(n)) — this alone is O(n) per iteration.
item not in duplicates scans the duplicates list (O(k) where k grows).
Total: O(n) iterations x O(n) membership check = O(n squared).

Why the fix is O(n):

item in seen is O(1) average (set hash lookup).
item not in duplicates_set is O(1) average (set hash lookup).
seen.add(item) is O(1) average.
Total: O(n) iterations x O(1) per iteration = O(n).

Starter Code

def find_duplicates_fast(items):
    """Return a list of items that appear more than once.
    Preserve the order of first duplicate occurrence.
    Must run in O(n) time.
    """
    # TODO: Use a set for O(1) membership checks
    pass

Expected Output

[3, 5, 1]

Hints

Hint 1: Track two sets: one for all items seen so far (for O(1) membership), and one for items already added to duplicates (to avoid adding the same duplicate twice).

Hint 2: For each item: if it is in seen AND not in duplicates output, add it to the result. Always add it to seen.

#6Choose the Right Data Structure for a Frequency CounterMedium

dictcounterdesign

Given a list of words, return the k most frequent words sorted by frequency (descending). Break ties alphabetically. Choose the optimal data structures to achieve O(n log k) time.

Python

# Test case
words = ["the", "day", "is", "sunny", "the", "the", "sunny", "is", "is"]
k = 3

# Expected: ['the', 'is', 'sunny'] — frequencies: the=3, is=3, sunny=2
# Tie between 'the' and 'is' broken alphabetically: is < the,
# but both have freq 3 so both appear before sunny (freq 2)

Solution

import heapq
from collections import Counter

def top_k_frequent(words, k):
    counts = Counter(words)  # O(n) — dict-based counting
    # Sort by (-frequency, word) so higher freq comes first,
    # and ties are broken alphabetically
    return heapq.nsmallest(
        k,
        counts.keys(),
        key=lambda w: (-counts[w], w)
    )  # O(n log k)

words = ["the", "day", "is", "sunny", "the", "the", "sunny", "is", "is"]
print(top_k_frequent(words, 3))

Complexity analysis:

Counting: Counter(words) iterates all n words, each dict update is O(1) average. Total: O(n).
Top-k extraction: heapq.nsmallest(k, ...) maintains a heap of size k, processing all unique words. Cost: O(m log k) where m is unique words (m is at most n, usually much less).
Total: O(n + m log k) which is better than O(n log n) from full sorting when k is much smaller than n.

Why dict/Counter is the right choice:

Need to map each word to its count — dict provides O(1) get/set.
A list of tuples would require O(n) scanning to find existing entries.
A set cannot store counts — only membership.

Starter Code

def top_k_frequent(words, k):
    """Return the k most frequent words, sorted by frequency (descending).
    Break ties alphabetically.
    Must run in O(n log k) time or better.
    """
    # TODO: Use a dict for counting, then extract top k
    pass

Expected Output

['the', 'is', 'sunny']

Hints

Hint 1: Use a dict (or collections.Counter) to count word frequencies in O(n). Then extract the top k.

Hint 2: heapq.nlargest(k, items, key=...) runs in O(n log k) which is better than sorting all items when k is small.

#7Sliding Window Maximum with dequeMedium

dequesliding-windowoptimization

The naive approach checks all k elements in each window position, giving O(n*k). Implement an O(n) solution using a deque to maintain a monotonic decreasing queue of candidates.

Python

# SLOW VERSION — O(n*k)
def max_sliding_window_slow(nums, k):
    return [max(nums[i:i+k]) for i in range(len(nums) - k + 1)]

nums = [1, 3, -1, -3, 5, 3, 6, 7]
k = 3
print(max_sliding_window_slow(nums, k))

Solution

from collections import deque

def max_sliding_window(nums, k):
    dq = deque()  # Stores indices, values at indices are decreasing
    result = []

    for i, num in enumerate(nums):
        # Remove indices outside the window
        while dq and dq[0] <= i - k:
            dq.popleft()  # O(1)

        # Remove indices whose values are smaller than current
        # (they can never be max while current element exists in window)
        while dq and nums[dq[-1]] <= num:
            dq.pop()  # O(1)

        dq.append(i)  # O(1)

        # Window is fully formed once we've seen k elements
        if i >= k - 1:
            result.append(nums[dq[0]])  # Front of deque is always the max

    return result

nums = [1, 3, -1, -3, 5, 3, 6, 7]
k = 3
print(max_sliding_window(nums, k))

Why this is O(n):

Each index is added to the deque exactly once and removed at most once. Total operations across all iterations: O(n) additions + O(n) removals = O(2n) = O(n).
deque.popleft(), deque.pop(), and deque.append() are all O(1).
The naive approach calls max() on k elements per window: O(k) per position times O(n) positions = O(n*k).

Why deque and not list:

We need O(1) removal from both ends. list.pop(0) would be O(n) — shifting all remaining elements. deque.popleft() is O(1).
This is a textbook case where deque's O(1) front operations are essential for the algorithm's overall complexity.

Starter Code

from collections import deque

def max_sliding_window(nums, k):
    """Return the maximum value in each sliding window of size k.
    Must run in O(n) time, not O(n*k).
    """
    # TODO: Use a deque to track indices of potential maximums
    pass

Expected Output

[3, 3, 5, 5, 6, 7]

Hints

Hint 1: Use a deque to store indices (not values) of elements that could be the maximum of some future window. Keep the deque in decreasing order of values.

Hint 2: Before adding a new index, pop all indices from the right whose values are smaller than the current element — they can never be a window maximum.

Hint 3: Before reading the max, pop indices from the left that are outside the current window (index less than or equal to i - k).

#8Amortized Analysis: Predict Resize BehaviorMedium

amortizedlistmemory

Predict when list resizes occur during 20 consecutive appends. Track the byte size changes and explain why the amortized cost is O(1) despite individual resize operations being O(n).

Python

import sys

lst = []
prev_size = sys.getsizeof(lst)
resize_count = 0

for i in range(20):
    lst.append(i)
    curr = sys.getsizeof(lst)
    if curr != prev_size:
        resize_count += 1
        # Each pointer is 8 bytes on 64-bit; base overhead is 56 bytes
        approx_capacity = (curr - 56) // 8
        print(f"Resize at length {len(lst)}: {prev_size} -> {curr} bytes (capacity ~{approx_capacity})")
        prev_size = curr

print(f"Total resizes in 20 appends: {resize_count}")
print(f"Amortized cost per append: O(1)")

Solution

Resize at length 1: 56 -> 88 bytes (capacity ~4)
Resize at length 5: 88 -> 120 bytes (capacity ~8)
Resize at length 9: 120 -> 184 bytes (capacity ~16)
Resize at length 17: 184 -> 248 bytes (capacity ~25)
Total resizes in 20 appends: 4
Amortized cost per append: O(1)

Amortized analysis explanation:

Resizes happen at lengths 1, 5, 9, 17 — exponentially spaced intervals.
At each resize, Python copies all existing elements to a new, larger array.
Copy costs: 0 + 4 + 8 + 16 = 28 total elements copied across all 4 resizes.
Non-resize appends: 16 appends at O(1) each.
Total work for 20 appends: 28 (resize copies) + 20 (individual writes) = 48 operations.
Average per append: 48/20 = 2.4 = O(1).

The geometric series argument:

If capacity doubles each time, resize costs form a geometric series: 1 + 2 + 4 + 8 + ... + n/2 = n - 1.
Total work for n appends = n (individual writes) + (n - 1) (total resize copies) = O(n).
Amortized per append = O(n)/n = O(1).

Expected Output

Resize at length 1: 56 -> 88 bytes (capacity ~4)
Resize at length 5: 88 -> 120 bytes (capacity ~8)
Resize at length 9: 120 -> 184 bytes (capacity ~16)
Resize at length 17: 184 -> 248 bytes (capacity ~25)
Total resizes in 20 appends: 4
Amortized cost per append: O(1)

Hints

Hint 1: Python list over-allocates by roughly 12.5% plus a small constant. The capacity jumps are approximately: 0, 4, 8, 16, 25, 35, 46, ...

Hint 2: Count the total elements copied across all resizes. The geometric growth means total copy cost is O(n) for n appends, giving O(1) amortized per append.

Hard

#9Design a Time-Complexity-Aware LRU CacheHard

dictdesigncacheoptimization

Implement an LRU (Least Recently Used) cache where both get and put operations run in O(1) time. This requires combining a hash map for O(1) lookup with an ordered structure for O(1) eviction.

Python

# Test case
cache = None  # Replace with LRUCache(2)

# cache.put(1, 1)
# cache.put(2, 2)
# print(f"get(1) = {cache.get(1)}")       # 1
# print(f"get(2) = {cache.get(2)}")       # 2
# cache.put(3, 3)                         # evicts key 1 (LRU)
# print(f"get(1) after adding 3 = {cache.get(1)}")  # -1 (evicted)
# print(f"get(2) after adding 3 = {cache.get(2)}")  #  2
# print(f"get(3) = {cache.get(3)}")       # 3

Solution

from collections import OrderedDict

class LRUCache:
    def __init__(self, capacity):
        self.capacity = capacity
        self.cache = OrderedDict()  # Maintains insertion/access order

    def get(self, key):
        if key not in self.cache:     # O(1) hash lookup
            return -1
        self.cache.move_to_end(key)   # O(1) mark as most recently used
        return self.cache[key]        # O(1)

    def put(self, key, value):
        if key in self.cache:             # O(1) hash lookup
            self.cache.move_to_end(key)   # O(1) update order
        self.cache[key] = value           # O(1) insert/update
        if len(self.cache) > self.capacity:
            self.cache.popitem(last=False) # O(1) evict LRU (front)

cache = LRUCache(2)
cache.put(1, 1)
cache.put(2, 2)
print(f"get(1) = {cache.get(1)}")
print(f"get(2) = {cache.get(2)}")
cache.put(3, 3)  # evicts key 1
print(f"get(1) after adding 3 = {cache.get(1)}")
print(f"get(2) after adding 3 = {cache.get(2)}")
print(f"get(3) = {cache.get(3)}")

Complexity analysis for every operation:

Operation	Method	Complexity
Key lookup	`key in self.cache`	O(1) avg — hash table
Move to end	`move_to_end(key)`	O(1) — internal doubly-linked list pointer update
Insert/update	`self.cache[key] = value`	O(1) avg — hash table
Evict LRU	`popitem(last=False)`	O(1) — remove from front of internal linked list
Length check	`len(self.cache)`	O(1) — stored in struct

Why OrderedDict works:

OrderedDict combines a hash table (for O(1) key lookup) with a doubly-linked list (for O(1) order manipulation).
A plain dict preserves insertion order in Python 3.7+ but has no move_to_end() — you cannot efficiently re-order entries.
A list-based approach would require O(n) to find and move elements.

Starter Code

from collections import OrderedDict

class LRUCache:
    """Least Recently Used cache with O(1) get and put.
    
    When the cache exceeds capacity, evict the least recently used item.
    """
    def __init__(self, capacity):
        # TODO: Choose data structures for O(1) operations
        pass

    def get(self, key):
        """Return value if key exists (and mark as recently used), else -1."""
        # TODO
        pass

    def put(self, key, value):
        """Insert or update key-value. Evict LRU item if over capacity."""
        # TODO
        pass

Expected Output

get(1) = 1
get(2) = 2
get(1) after adding 3 = 1
get(2) after adding 3 = -1
get(3) = 3

Hints

Hint 1: You need O(1) key lookup (dict) combined with O(1) order tracking (doubly-linked list or OrderedDict). A plain dict does not track access order.

Hint 2: OrderedDict.move_to_end(key) is O(1) and marks a key as most recently used. popitem(last=False) evicts the least recently used in O(1).

Hint 3: On get: if key exists, move_to_end then return value. On put: if key exists, update and move_to_end. If new and at capacity, popitem(last=False) first.

#10Benchmark and Prove Complexity EmpiricallyHard

timeitprofilingempirical-analysis

Write a benchmarking function that empirically proves the time complexity of an operation by measuring execution time at multiple input sizes and showing the growth factor. Use it to verify that list membership is O(n) and set membership is O(1).

Python

import time

# Demonstrate: timing list vs set membership at 3 sizes
sizes = [1_000, 10_000, 100_000]

print("--- List membership (expect O(n) — linear growth) ---")
prev_time = None
print(f"{'n':>8} {'time_ms':>12} {'growth':>8}")
for n in sizes:
    data = list(range(n))
    target = n - 1
    start = time.perf_counter()
    for _ in range(1000):
        _ = target in data
    elapsed = (time.perf_counter() - start) / 1000 * 1000
    growth = f"~{elapsed/prev_time:.0f}x" if prev_time else "--"
    print(f"{n:>8,} {elapsed:>10.2f}ms {growth:>8}")
    prev_time = elapsed

Solution

import time

def benchmark_operation(setup_fn, test_fn, sizes, label=""):
    print(f"--- {label} ---")
    print(f"{'n':>8} {'time_ms':>12} {'growth':>8}")
    prev_time = None
    for n in sizes:
        data = setup_fn(n)
        start = time.perf_counter()
        for _ in range(1000):
            test_fn(data)
        elapsed = (time.perf_counter() - start) / 1000 * 1000
        growth = f"~{elapsed/prev_time:.0f}x" if prev_time else "--"
        print(f"{n:>8,} {elapsed:>10.2f}ms {growth:>8}")
        prev_time = elapsed

sizes = [1_000, 10_000, 100_000]

benchmark_operation(
    setup_fn=lambda n: list(range(n)),
    test_fn=lambda data: (len(data) - 1) in data,
    sizes=sizes,
    label="List membership (expect O(n) — linear growth)"
)

print()

benchmark_operation(
    setup_fn=lambda n: set(range(n)),
    test_fn=lambda data: (max(data)) in data,
    sizes=sizes,
    label="Set membership (expect O(1) — constant)"
)

How to interpret the growth factor:

Observed Growth (10x input)	Implied Complexity
~1x	O(1) — constant
~3.3x	O(log n) — logarithmic
~10x	O(n) — linear
~33x	O(n log n) — linearithmic
~100x	O(n squared) — quadratic

Key insights:

A single timing measurement is unreliable due to OS scheduling, cache effects, and garbage collection. Averaging over 1000+ runs smooths out noise.
The growth factor is the key signal: if doubling input consistently doubles time, the operation is O(n). If time stays flat, it is O(1).
time.perf_counter() is more precise than time.time() — it uses the highest-resolution clock available on your OS.
For production profiling, use the timeit module which handles warmup, repetition, and GC disabling automatically.

Starter Code

import time

def benchmark_operation(setup_fn, test_fn, sizes):
    """Benchmark test_fn at different input sizes.
    
    setup_fn(n) -> returns the data structure to test on
    test_fn(data) -> performs the operation to benchmark
    sizes -> list of input sizes to test
    
    Print a table showing size, time, and growth factor.
    """
    # TODO: For each size, run setup_fn, then time test_fn
    # Print results showing how time scales with input size
    pass

Expected Output

--- List membership (expect O(n) — linear growth) ---
       n      time_ms    growth
   1,000       0.03ms       --
  10,000       0.24ms    ~10x
 100,000       2.40ms    ~10x

--- Set membership (expect O(1) — constant) ---
       n      time_ms    growth
   1,000       0.00ms       --
  10,000       0.00ms     ~1x
 100,000       0.00ms     ~1x

Hints

Hint 1: To get reliable timing, run the operation many times (e.g., 1000 repetitions) and take the average. Single runs are too noisy.

Hint 2: Calculate the growth factor as time(n) / time(n_prev). For O(n) operations, 10x input should give ~10x time. For O(1), the ratio should stay near 1x.

#11Optimize a Multi-Step Data PipelineHard

optimizationdesigndictsetdeque

You have a stream of log events. Design a pipeline that deduplicates events within a time window, counts actions, and tracks unique users — all in O(n) time. Choose the right data structure for each sub-task and justify your choices.

Python

# Sample event stream
events = [
    ("user1", "click", 1.0),
    ("user1", "click", 1.5),    # duplicate — same user+action within 2s
    ("user2", "click", 2.0),
    ("user1", "purchase", 3.5),
    ("user3", "view", 4.0),
    ("user1", "click", 4.5),    # NOT duplicate — 3.5s since last click by user1
]

# Expected after processing with window_size=2.0:
# Deduplicated events: 4 (second click by user1 is a dup)
# Action counts: click=2, purchase=1, view=1 (user1 click at 4.5 counted since outside window)
# Unique users: 3

Solution

from collections import Counter

def process_log_stream(events, window_size):
    # dict: (user_id, action) -> last_timestamp — O(1) lookup/update
    last_seen = {}

    # Counter (dict subclass): action -> count — O(1) increment
    action_counts = Counter()

    # set: unique user IDs — O(1) add/membership
    unique_users = set()

    processed = 0

    for user_id, action, timestamp in events:  # O(n) iterations
        key = (user_id, action)

        # Dedup check: O(1) dict lookup
        if key in last_seen and (timestamp - last_seen[key]) < window_size:
            continue  # Skip duplicate

        # Update last seen: O(1) dict write
        last_seen[key] = timestamp

        # Count action: O(1) Counter increment
        action_counts[action] += 1

        # Track unique users: O(1) set add
        unique_users.add(user_id)

        processed += 1

    return action_counts, len(unique_users), processed

events = [
    ("user1", "click", 1.0),
    ("user1", "click", 1.5),
    ("user2", "click", 2.0),
    ("user1", "purchase", 3.5),
    ("user3", "view", 4.0),
    ("user1", "click", 4.5),
]

counts, unique, processed = process_log_stream(events, window_size=2.0)

action_str = ", ".join(f"{a}={c}" for a, c in counts.items())
print(f"Action counts: {action_str}")
print(f"Unique users: {unique}")
print(f"Processed (after dedup): {processed} out of {len(events)} events")

Data structure choices and their complexity justification:

Sub-task	Data Structure	Why	Per-operation Cost
Dedup tracking	`dict` (key: tuple, value: timestamp)	Need O(1) lookup by composite key + O(1) update	O(1) avg
Action counting	`Counter` (dict subclass)	Need O(1) increment by action name	O(1) avg
Unique user tracking	`set`	Need O(1) add + implicit dedup (set ignores duplicates)	O(1) avg

Total complexity: O(n) — one pass through events, all operations inside the loop are O(1) average.

Space complexity: O(k) where k = number of unique (user, action) pairs + unique actions + unique users. Since k is at most n but typically much less, this is efficient.

Why NOT these alternatives:

List for dedup: (user_id, action) in seen_list would be O(k) per check, making the whole pipeline O(n*k).
Sorting for dedup: Sorting by (user, action, time) would be O(n log n) — worse than the O(n) dict approach.
Nested dicts: last_seen[user_id][action] adds complexity without improving Big-O. A flat dict with tuple keys is simpler and equally fast.

Starter Code

from collections import deque, Counter

def process_log_stream(events, window_size):
    """Process a stream of log events efficiently.
    
    For each event (user_id, action, timestamp):
    1. Deduplicate: skip if same (user_id, action) seen within last window_size seconds
    2. Count unique users per action
    3. Return (action_counts, total_unique_users, total_processed)
    
    Must run in O(n) time where n = len(events).
    """
    # TODO: Choose optimal data structures for each task
    pass

Expected Output

Action counts: click=2, purchase=1, view=1
Unique users: 3
Processed (after dedup): 4 out of 6 events

Hints

Hint 1: Use a dict mapping (user_id, action) to the latest timestamp for O(1) dedup checks. An event is a duplicate if the same key was seen within window_size seconds.

Hint 2: Use a Counter (dict subclass) for action counts and a set for tracking unique users — both provide O(1) operations.

Hint 3: Every operation inside the loop must be O(1) average. If any operation is O(n), the whole pipeline becomes O(n squared).

Practice: Time Complexity of Data Structures

Easy​

Medium​

Hard​

Easy

Medium

Hard