Heap and Priority Queue - Efficient Minimum Tracking
Reading time: ~20 minutes | Level: Foundation → Engineering
Here is a problem that exposes whether you understand the fundamental tradeoff between sorting and heaps.
import heapq
import time
import random
data = [random.randint(0, 1_000_000) for _ in range(10_000_000)]
k = 10
# Method A: full sort, then take top-k
start = time.perf_counter()
result_a = sorted(data, reverse=True)[:k]
sort_time = time.perf_counter() - start
# Method B: heapq.nlargest
start = time.perf_counter()
result_b = heapq.nlargest(k, data)
heap_time = time.perf_counter() - start
print(f"sorted()[-k:]: {sort_time:.3f}s - O(n log n)")
print(f"heapq.nlargest(k): {heap_time:.3f}s - O(n log k)")
print(f"Heap is {sort_time / heap_time:.1f}x faster for k={k}")
print(f"Results equal: {sorted(result_a) == sorted(result_b)}")
Output (approximate):
sorted()[-k:]: 6.241s - O(n log n)
heapq.nlargest(k): 1.083s - O(n log k)
Heap is 5.8x faster for k=10
Results equal: True
Both return the same 10 largest values from 10 million elements. The heap is 5.8x faster because it avoids fully sorting all 10 million elements - it only maintains a 10-element heap throughout the scan. This is the difference between O(n log n) and O(n log k) where k=10.
Understanding when to use a heap instead of a sort is a mark of engineering maturity.
What You Will Learn
- The binary heap data structure: shape property, heap invariant, and array representation
- How Python's
heapqmodule implements a min-heap using a plain list - The index mapping formula: parent at
i, children at2i+1and2i+2 heapq.heappush(),heapq.heappop(),heapq.heapify()- internals and complexity- Max-heap simulation: the negation pattern and its pitfalls
heapq.nlargest()andheapq.nsmallest()- when they beat full sortingqueue.PriorityQueue- thread-safe priority queue and when to use it- Heap-based algorithmic patterns: top-K, merge K sorted lists, running median
- Real-world applications: Dijkstra's shortest path, event simulation, task scheduling
Prerequisites
- Python lists internals - dynamic arrays and indexing (Lesson 1)
- Time complexity fundamentals - O(log n) and O(n) (Lesson 9)
- Basic Python classes and tuples
Part 1 - The Binary Heap Data Structure
What Is a Heap?
A heap is a complete binary tree that satisfies the heap property. "Complete" means all levels are fully filled except possibly the last, which is filled left to right.
Rules: Parent(i) ≤ Child_left(i) and Parent(i) ≤ Child_right(i) always. The minimum is always at the root. No other ordering is guaranteed - sibling order is arbitrary (note: 3 is left of 2, but that is fine).
Note: 3 is left of 2, but that is fine. The heap only guarantees parent ≤ children, not left ≤ right.
Max-Heap
The maximum is always at the root. Python's heapq implements only min-heaps.
Why Binary? Why Complete?
A binary tree with the "complete" constraint maps perfectly to a flat array - no pointers needed. This is the key insight that makes heaps memory-efficient.
Part 2 - Heap as a Flat List: The Index Mapping
Python's heapq does not use a tree class. It uses a regular Python list with a mathematical index mapping.
Heap stored as flat array (0-indexed):
Index: 0 1 2 3 4 5 6 7 8
Array: [1, 3, 2, 7, 4, 5, 6, 8, 9]
For element at index i:
| Relationship | Formula | Example (i=1, value=3) |
|---|---|---|
| Parent | (i - 1) // 2 | (1-1)//2 = 0 → value 1 ✓ |
| Left child | 2 * i + 1 | 2*1+1 = 3 → value 7 ✓ |
| Right child | 2 * i + 2 | 2*1+2 = 4 → value 4 ✓ |
The flat array representation:
- Requires zero pointers (unlike linked tree nodes)
- Is cache-friendly - children are at predictable offsets
- Enables O(1) parent and child access via index arithmetic
import heapq
# heapq uses a regular list - you can inspect it directly
heap = []
for val in [5, 3, 8, 1, 4, 2, 7]:
heapq.heappush(heap, val)
print(heap) # [1, 3, 2, 4, 5, 8, 7]
# ↑
# Always the minimum (root at index 0)
# Verify heap invariant: every parent ≤ both children
def verify_heap(h):
for i in range(1, len(h)):
parent_idx = (i - 1) // 2
assert h[parent_idx] <= h[i], f"Violated at index {i}: parent={h[parent_idx]}, child={h[i]}"
print(f"Heap invariant holds for {len(h)} elements")
verify_heap(heap) # Heap invariant holds for 7 elements
Part 3 - heapq Operations: Internals and Complexity
heapq.heappush() - Insert and Sift Up
Inserting into a heap: place the new element at the end of the array (maintaining the "complete" shape), then sift up - repeatedly swap with parent until the heap invariant is restored.
heappush(heap, 0) - inserting 0 into [1, 3, 2, 7, 4, 5, 6]:
Step 1: Append to end (index 7)
[1, 3, 2, 7, 4, 5, 6, 0]
Step 2: Compare 0 with parent at (7-1)//2 = 3 → value 7
0 < 7 → swap
[1, 3, 2, 0, 4, 5, 6, 7]
Step 3: Compare 0 with parent at (3-1)//2 = 1 → value 3
0 < 3 → swap
[1, 0, 2, 3, 4, 5, 6, 7]
Step 4: Compare 0 with parent at (1-1)//2 = 0 → value 1
0 < 1 → swap
[0, 1, 2, 3, 4, 5, 6, 7]
Step 5: Reached root - done
Result: [0, 1, 2, 3, 4, 5, 6, 7]
Maximum swaps = height of tree = log₂(n)
Complexity: O(log n)
heapq.heappop() - Remove Minimum and Sift Down
Removing the minimum: take the root (index 0), move the last element to the root, then sift down - repeatedly swap with the smaller child until the heap invariant is restored.
heappop([0, 1, 2, 3, 4, 5, 6, 7]) → returns 0
Step 1: Save root (0), move last element to root
[7, 1, 2, 3, 4, 5, 6]
Step 2: Compare 7 with children: left=1 (idx 1), right=2 (idx 2)
Smaller child is 1. 7 > 1 → swap with left child
[1, 7, 2, 3, 4, 5, 6]
Step 3: Compare 7 with children: left=3 (idx 3), right=4 (idx 4)
Smaller child is 3. 7 > 3 → swap with left child
[1, 3, 2, 7, 4, 5, 6]
Step 4: Compare 7 with children: left=8 (idx 7) - doesn't exist
No children - done
Result: [1, 3, 2, 7, 4, 5, 6], returned value: 0
Maximum swaps = height of tree = log₂(n)
Complexity: O(log n)
heapq.heapify() - Build Heap from List: O(n)
You might expect building a heap from n elements to cost O(n log n) - push each element one by one. But heapify() uses a bottom-up approach that is provably O(n).
heapify([5, 3, 8, 1, 4, 2, 7])
Start from the last non-leaf (index n//2 - 1) and sift down:
Process index 2 (value 8): children are 2,7 - 2 < 8 → swap
Process index 1 (value 3): children are 1,4 - 1 < 3 → swap
Process index 0 (value 5): children after swaps - sift down
Result: [1, 3, 2, 5, 4, 8, 7] ← valid min-heap
Why O(n)?
Most nodes are near the leaves. Leaf nodes do O(1) work (no children to sift to).
Only the root does O(log n) work.
Sum: n/2 × 0 + n/4 × 1 + n/8 × 2 + ... = O(n) (mathematical proof via geometric series)
heapify() is O(n) - always use it instead of n × heappush() when starting from a list.
import heapq
data = [5, 3, 8, 1, 4, 2, 7]
# heapify: O(n)
heap = data[:]
heapq.heapify(heap)
print(heap) # [1, 3, 2, 5, 4, 8, 7]
# Equivalent but O(n log n) - slower:
heap2 = []
for x in data:
heapq.heappush(heap2, x)
print(heap2) # [1, 3, 2, 5, 4, 8, 7] - same result, slower construction
# Always use heapify() when you have the full list upfront
Complete Complexity Table
| Operation | Complexity | Notes |
|---|---|---|
heapq.heappush(heap, x) | O(log n) | Append + sift up |
heapq.heappop(heap) | O(log n) | Remove root + sift down |
heapq.heapify(lst) | O(n) | Bottom-up heap construction |
heap[0] (peek minimum) | O(1) | Root is always at index 0 |
heapq.heappushpop(heap, x) | O(log n) | Push then pop - more efficient |
heapq.heapreplace(heap, x) | O(log n) | Pop then push - even more efficient |
heapq.nlargest(k, data) | O(n log k) | Maintains k-element heap |
heapq.nsmallest(k, data) | O(n log k) | Maintains k-element heap |
len(heap) | O(1) | Python list len |
import heapq
# Basic heap operations
heap = []
# Push elements
for val in [5, 3, 8, 1, 4, 2, 7]:
heapq.heappush(heap, val)
print(f"Heap: {heap}") # [1, 3, 2, 5, 4, 8, 7]
print(f"Minimum: {heap[0]}") # 1 - peek without removing
# Pop elements (always in ascending order from min-heap)
popped = []
while heap:
popped.append(heapq.heappop(heap))
print(f"Popped: {popped}") # [1, 2, 3, 4, 5, 7, 8]
# heappushpop: push new element then pop minimum
# More efficient than heappush + heappop separately
heap = [1, 3, 5, 7, 9]
heapq.heapify(heap)
result = heapq.heappushpop(heap, 4)
print(f"Pushed 4, popped: {result}") # 1 (minimum)
print(f"Heap after: {heap}") # [3, 4, 5, 7, 9]
# heapreplace: pop minimum first, then push new element
# Guaranteed that new element is placed (unlike heappushpop which may pop it)
heap = [1, 3, 5, 7, 9]
heapq.heapify(heap)
result = heapq.heapreplace(heap, 0)
print(f"Replaced, popped: {result}") # 1
print(f"Heap after: {heap}") # [0, 3, 5, 7, 9]
Part 4 - Max-Heap: The Negation Pattern
Python's heapq only implements min-heaps. To simulate a max-heap, negate all values.
import heapq
# Min-heap (default): smallest element first
min_heap = [3, 1, 4, 1, 5, 9, 2, 6]
heapq.heapify(min_heap)
print(heapq.heappop(min_heap)) # 1 - smallest
# Max-heap simulation: negate values
values = [3, 1, 4, 1, 5, 9, 2, 6]
max_heap = [-x for x in values] # negate all values
heapq.heapify(max_heap)
print(-heapq.heappop(max_heap)) # 9 - largest (negate back)
# Or push/pop with negation
max_heap = []
for x in [3, 1, 4, 1, 5, 9, 2, 6]:
heapq.heappush(max_heap, -x) # store negated
largest = -heapq.heappop(max_heap)
print(largest) # 9
:::warning Max-Heap Negation Pitfall with Complex Objects Negation works cleanly for numbers. For tuples used in priority queues, negate only the priority field:
# Correct: negate priority, keep rest unchanged
heapq.heappush(max_heap, (-priority, item))
priority, item = heapq.heappop(max_heap)
priority = -priority # restore original value
For objects that cannot be negated (strings, custom objects), use queue.PriorityQueue or invert your priority scale (e.g., lower number = higher priority).
:::
Part 5 - heapq.nlargest and heapq.nsmallest
These functions efficiently find the k largest or smallest elements without fully sorting.
import heapq
data = [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5]
# 3 largest
print(heapq.nlargest(3, data)) # [9, 6, 5]
# 3 smallest
print(heapq.nsmallest(3, data)) # [1, 1, 2]
# With key function
records = [
{"name": "Alice", "score": 95},
{"name": "Bob", "score": 72},
{"name": "Carol", "score": 88},
{"name": "Dave", "score": 99},
{"name": "Eve", "score": 81},
]
top_3 = heapq.nlargest(3, records, key=lambda r: r["score"])
for r in top_3:
print(f"{r['name']}: {r['score']}")
# Dave: 99
# Alice: 95
# Carol: 88
When nlargest/nsmallest Beat Full Sort
| Condition | Best Approach | Complexity |
|---|---|---|
| k == 1 (single min/max) | min() or max() | O(n) |
| k ≪ n (much smaller than n) | heapq.nlargest / nsmallest | O(n log k) |
| k ≈ n (close to n) | sorted()[-k:] | O(n log n) |
| k == n (all elements) | sorted() | O(n log n) |
CPython's heapq.nlargest uses this logic internally: k=1 uses max(), k >= n/2 uses sorted(), otherwise heap-based O(n log k).
import heapq
import timeit
import random
data = random.sample(range(10_000_000), 1_000_000)
for k in [10, 1000, 10000, 100000]:
t_heap = timeit.timeit(
f"heapq.nlargest({k}, data)",
globals={"heapq": heapq, "data": data}, number=5
) / 5
t_sort = timeit.timeit(
f"sorted(data, reverse=True)[:{k}]",
globals={"data": data}, number=5
) / 5
ratio = t_sort / t_heap
print(f"k={k:>7,} heap={t_heap:.3f}s sort={t_sort:.3f}s ratio={ratio:.2f}x")
Output (approximate):
k= 10 heap=0.067s sort=0.631s ratio=9.42x
k= 1,000 heap=0.138s sort=0.631s ratio=4.57x
k= 10,000 heap=0.284s sort=0.631s ratio=2.22x
k=100,000 heap=0.572s sort=0.631s ratio=1.10x
The heap advantage disappears as k approaches n. At k=100,000 (10% of n=1M), both are nearly equal.
Part 6 - Priority Queue with Tuples
The standard pattern for a priority queue in Python uses a tuple (priority, item). Python compares tuples lexicographically - the first element (priority) determines order.
import heapq
# Task scheduler: lower number = higher priority
task_queue = []
heapq.heappush(task_queue, (1, "Deploy hotfix"))
heapq.heappush(task_queue, (5, "Update docs"))
heapq.heappush(task_queue, (2, "Fix security bug"))
heapq.heappush(task_queue, (1, "Restart server")) # same priority as first
heapq.heappush(task_queue, (3, "Add feature"))
print("Processing order:")
while task_queue:
priority, task = heapq.heappop(task_queue)
print(f" Priority {priority}: {task}")
Output:
Processing order:
Priority 1: Deploy hotfix
Priority 1: Restart server
Priority 2: Fix security bug
Priority 3: Add feature
Priority 5: Update docs
Handling Equal Priorities: The Counter Trick
When two items have equal priority and the item itself is not comparable (e.g., a custom object), Python will try to compare the items and may raise TypeError.
import heapq
import itertools
# Problem: if priorities are equal, Python compares the second tuple element
# If items don't support comparison, this raises TypeError
class Task:
def __init__(self, name):
self.name = name
# No __lt__ defined
counter = itertools.count() # Unique, ever-increasing integer
task_queue = []
def push_task(priority, task):
# (priority, unique_counter, task)
# Counter ensures no two entries are identical - comparison stops at counter
heapq.heappush(task_queue, (priority, next(counter), task))
push_task(1, Task("Deploy hotfix"))
push_task(1, Task("Restart server")) # Same priority - counter prevents TypeError
push_task(2, Task("Fix bug"))
while task_queue:
priority, _, task = heapq.heappop(task_queue)
print(f" Priority {priority}: {task.name}")
Output:
Priority 1: Deploy hotfix
Priority 1: Restart server
Priority 2: Fix bug
The counter acts as a FIFO tiebreaker - when priorities are equal, earlier-pushed items come out first (counter value is smaller).
Part 7 - queue.PriorityQueue: Thread-Safe Alternative
For multi-threaded applications, use queue.PriorityQueue instead of heapq. It wraps a heap with locking.
from queue import PriorityQueue
import threading
import time
pq = PriorityQueue()
def producer(name, items):
for priority, task in items:
pq.put((priority, task))
time.sleep(0.01)
def consumer(name):
while True:
priority, task = pq.get() # blocks until item available
print(f"{name} processing priority={priority}: {task}")
pq.task_done()
if task == "STOP":
break
# Thread-safe: multiple producers can push simultaneously
t1 = threading.Thread(target=producer, args=("Producer1", [(1, "critical"), (3, "low")]))
t2 = threading.Thread(target=producer, args=("Producer2", [(2, "medium"), (1, "urgent")]))
tc = threading.Thread(target=consumer, args=("Consumer",))
t1.start(); t2.start(); tc.start()
t1.join(); t2.join()
pq.put((99, "STOP")) # Signal consumer to stop
tc.join()
| Feature | heapq | queue.PriorityQueue |
|---|---|---|
| Thread safety | NOT thread-safe | Thread-safe (locked) |
| Blocking get | No (raises IndexError) | Yes (blocks) |
| Max size limit | No | Yes (maxsize param) |
| Performance | Faster (no lock) | Slower (lock overhead) |
| Use case | Single-threaded | Multi-threaded |
| Direct list access | Yes (heap[0]) | No |
:::note Use heapq for Single-Threaded, PriorityQueue for Multi-Threaded
heapq is faster because it avoids lock overhead - use it in single-threaded algorithms (Dijkstra's, event simulation, batch processing). queue.PriorityQueue is the right choice when multiple threads produce and consume tasks concurrently - its blocking get() and put() methods prevent race conditions.
:::
Part 8 - Heap Patterns in Practice
Pattern 1: Streaming Top-K (Fixed Memory)
Process a stream of unknown length, maintaining only the k largest elements seen so far.
import heapq
def top_k_stream(stream, k):
"""
Maintain top-k largest elements from a stream.
Memory: O(k) regardless of stream length.
Time: O(n log k) total.
"""
heap = [] # min-heap of size k
for item in stream:
if len(heap) < k:
heapq.heappush(heap, item)
elif item > heap[0]: # item is larger than current minimum in top-k
heapq.heapreplace(heap, item) # pop min, push item (single O(log k) op)
return sorted(heap, reverse=True) # O(k log k) - small
# Simulate processing 10M streaming numbers
import random
stream = (random.randint(0, 1_000_000) for _ in range(10_000_000))
result = top_k_stream(stream, k=10)
print(f"Top 10: {result}")
# e.g., [999998, 999991, 999988, ...]
# Memory used: only 10 integers - regardless of stream size
Pattern 2: Merge K Sorted Lists
A classic interview problem and common data pipeline operation.
import heapq
def merge_k_sorted(lists):
"""
Merge k sorted lists into one sorted list.
Time: O(n log k) where n = total elements, k = number of lists
Memory: O(k) - only one element from each list in heap at a time
Algorithm:
1. Initialize heap with first element from each list (with list index)
2. Pop minimum → add to result
3. Push next element from same list
4. Repeat until heap empty
"""
heap = []
# Heap entry: (value, list_index, element_index)
for i, lst in enumerate(lists):
if lst:
heapq.heappush(heap, (lst[0], i, 0))
result = []
while heap:
val, list_idx, elem_idx = heapq.heappop(heap)
result.append(val)
next_idx = elem_idx + 1
if next_idx < len(lists[list_idx]):
next_val = lists[list_idx][next_idx]
heapq.heappush(heap, (next_val, list_idx, next_idx))
return result
# Example: merging 4 sorted lists
lists = [
[1, 4, 7, 10],
[2, 5, 8, 11],
[3, 6, 9, 12],
[0, 13, 14, 15],
]
merged = merge_k_sorted(lists)
print(merged)
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
# Real-world: merging sorted partitions from distributed sort,
# merging sorted log files from multiple servers
Pattern 3: Running Median with Two Heaps
Maintain the median of a stream efficiently without re-sorting on each insert.
import heapq
class RunningMedian:
"""
Maintains running median of a stream.
Time: O(log n) per insert, O(1) per query.
Memory: O(n)
Algorithm:
- max_heap: lower half of data (negated for max-heap behavior)
- min_heap: upper half of data
- Invariant: max_heap has same size or one more element than min_heap
- Median = max_heap[0] (if sizes differ) or average of both tops (if equal)
"""
def __init__(self):
self.max_heap = [] # lower half - stored negated (max-heap simulation)
self.min_heap = [] # upper half - actual values (min-heap)
def add(self, num):
# Push to max_heap (lower half)
heapq.heappush(self.max_heap, -num)
# Balance: ensure max_heap's max ≤ min_heap's min
if self.max_heap and self.min_heap and (-self.max_heap[0] > self.min_heap[0]):
val = -heapq.heappop(self.max_heap)
heapq.heappush(self.min_heap, val)
# Rebalance sizes: max_heap can have at most one more element than min_heap
if len(self.max_heap) > len(self.min_heap) + 1:
val = -heapq.heappop(self.max_heap)
heapq.heappush(self.min_heap, val)
elif len(self.min_heap) > len(self.max_heap):
val = heapq.heappop(self.min_heap)
heapq.heappush(self.max_heap, -val)
def get_median(self):
if not self.max_heap:
return None
if len(self.max_heap) > len(self.min_heap):
return float(-self.max_heap[0])
return (-self.max_heap[0] + self.min_heap[0]) / 2.0
rm = RunningMedian()
stream = [5, 2, 8, 1, 9, 3, 7, 4, 6]
for num in stream:
rm.add(num)
print(f"Added {num:2d} → median = {rm.get_median()}")
Output:
Added 5 → median = 5.0
Added 2 → median = 3.5
Added 8 → median = 5.0
Added 1 → median = 3.5
Added 9 → median = 5.0
Added 3 → median = 4.0
Added 7 → median = 5.0
Added 4 → median = 4.5
Added 6 → median = 5.0
Pattern 4: Dijkstra's Shortest Path
The canonical graph algorithm that requires a priority queue.
import heapq
from collections import defaultdict
def dijkstra(graph, start):
"""
Find shortest paths from start to all reachable nodes.
Time: O((V + E) log V) where V = vertices, E = edges
Args:
graph: dict of {node: [(neighbor, weight), ...]}
start: starting node
Returns:
dist: dict of {node: shortest_distance_from_start}
"""
# Priority queue: (distance, node)
heap = [(0, start)]
dist = {start: 0}
while heap:
current_dist, node = heapq.heappop(heap)
# If we already found a shorter path to this node, skip
if current_dist > dist.get(node, float("inf")):
continue
for neighbor, weight in graph.get(node, []):
new_dist = current_dist + weight
if new_dist < dist.get(neighbor, float("inf")):
dist[neighbor] = new_dist
heapq.heappush(heap, (new_dist, neighbor))
return dist
# Example: city road network
graph = {
"A": [("B", 4), ("C", 2)],
"B": [("D", 3), ("C", 1)],
"C": [("B", 1), ("D", 5)],
"D": [],
}
distances = dijkstra(graph, "A")
print("Shortest distances from A:")
for node, dist in sorted(distances.items()):
print(f" A → {node}: {dist}")
Output:
Shortest distances from A:
A → A: 0
A → B: 3 (A→C→B: 2+1=3, shorter than A→B: 4)
A → C: 2
A → D: 6 (A→C→B→D: 2+1+3=6, or A→C→D: 2+5=7, minimum is 6)
Part 9 - Event-Driven Simulation
Heaps power discrete event simulation - systems where events are processed in chronological order.
import heapq
from dataclasses import dataclass, field
from typing import Any
@dataclass(order=True)
class Event:
time: float
priority: int # tiebreaker for same-time events
action: Any = field(compare=False) # not compared - use time + priority
class EventSimulator:
"""
Discrete event simulation engine.
Events are processed in time order using a min-heap.
"""
def __init__(self):
self.heap = []
self.counter = 0 # unique tiebreaker
self.current_time = 0.0
def schedule(self, time, action, priority=0):
"""Schedule an event at the given time."""
event = Event(time=time, priority=priority, action=action)
heapq.heappush(self.heap, event)
def run(self, until=float("inf")):
"""Process all events up to `until` time."""
while self.heap and self.heap[0].time <= until:
event = heapq.heappop(self.heap)
self.current_time = event.time
print(f"t={event.time:.1f}: {event.action}")
# In a real simulator, the action function would schedule new events
# Simulation: server processing requests
sim = EventSimulator()
sim.schedule(0.0, "Server starts")
sim.schedule(0.5, "Request #1 arrives")
sim.schedule(0.8, "Request #2 arrives")
sim.schedule(1.2, "Request #1 processed")
sim.schedule(1.5, "Request #3 arrives")
sim.schedule(2.0, "Request #2 processed")
sim.schedule(2.3, "Request #3 processed")
sim.run(until=5.0)
Output:
t=0.0: Server starts
t=0.5: Request #1 arrives
t=0.8: Request #2 arrives
t=1.2: Request #1 processed
t=1.5: Request #3 arrives
t=2.0: Request #2 processed
t=2.3: Request #3 processed
Interview Questions
Q1: What is the heap invariant and how does Python's heapq maintain it?
Answer: The heap invariant for a min-heap: every parent element must be less than or equal to both its children. Python's heapq maintains this invariant through two operations: sift-up (after insertion) and sift-down (after removal). After heappush: the new element is appended to the end of the list, then repeatedly swapped with its parent (index (i-1)//2) until it is larger than its parent or reaches the root - O(log n) swaps maximum. After heappop: the minimum (index 0) is returned, the last element is placed at the root, then it is repeatedly swapped with its smaller child until both children are larger or it reaches a leaf - O(log n) swaps maximum. The invariant is never violated externally because the list is modified only through heapq functions.
Q2: Why is heapify() O(n) rather than O(n log n)?
Answer: A naive approach - pushing n elements one by one - costs O(n log n). heapify() uses a bottom-up algorithm: it starts from the last non-leaf node (index n//2 - 1) and calls sift-down on each node moving toward the root. The key insight: nodes near the leaves (the majority) have almost no sifting to do. Leaf nodes do O(1) work. Nodes at height 1 do O(1) sift-down. Nodes at height h do O(h) work. The total cost: Σ (n / 2^h) × h for h from 0 to log(n) = O(n). This geometric series sums to O(n). In contrast, inserting n elements one-by-one works top-down: each insertion may sift from leaf to root - O(log n) per element, O(n log n) total.
Q3: How do you implement a max-heap using Python's heapq?
Answer: Python's heapq only provides a min-heap. The standard workaround is value negation: store -x instead of x. The smallest negated value corresponds to the largest original value. To push: heapq.heappush(heap, -x). To pop the maximum: max_val = -heapq.heappop(heap). To peek: max_val = -heap[0]. This works correctly for integers and floats. For tuples (priority queues), negate only the priority field: heapq.heappush(heap, (-priority, item)). For custom objects that cannot be negated, you must implement a wrapper class with an inverted __lt__, or use a different data structure like sortedcontainers.SortedList.
Q4: When should you use heapq.nlargest(k, data) instead of sorted(data, reverse=True)[:k]?
Answer: Use heapq.nlargest(k, data) when k is significantly smaller than n. nlargest maintains a min-heap of size k, scanning all n elements once - O(n log k). sorted()[:k] sorts all n elements first - O(n log n). The crossover point: when k approaches n/2, the constant factors make sorted() competitive. CPython's heapq.nlargest internally uses sorted() when k >= n/2. Rule of thumb: for k less than ~10-20% of n, use nlargest; for larger k or when you need the full sorted list anyway, use sorted(). For k=1, always use max() - O(n) with minimal overhead.
Q5: Explain why a heap is more appropriate than a sorted list for a real-time task scheduler that receives continuous new tasks.
Answer: A sorted list supports O(1) access to the minimum (first element) but O(n) insertion - you must find the correct position and shift elements. A heap supports O(log n) for both insertion and minimum retrieval. For a scheduler receiving n tasks per second continuously: sorted list insertion = O(n) per task → O(n²) total for n tasks. Heap insertion = O(log n) per task → O(n log n) total. The heap's O(log n) insertion is the critical advantage in dynamic scenarios where the data is constantly changing. If you received all tasks at once (static scenario), heapify + heappop = O(n) + O(n log n) vs sorted list: the same. But schedulers are never static - they receive tasks continuously while processing others.
Q6: Describe the "merge K sorted lists" problem and explain why a heap solves it efficiently.
Answer: Given k sorted lists with n total elements, merge them into one sorted list. Naive approach: concatenate all lists and sort - O(n log n). Better: use a min-heap of size k. Initialize with the first element from each list. Repeat: pop the minimum (O(log k)), add to result, push the next element from the same list (O(log k)). Total: n pop+push pairs, each O(log k) → O(n log k). Since k ≤ n, this is always at least as good as O(n log n) and better when k << n. Real-world applications: merging sorted log files from multiple servers, merging sorted database partitions after a distributed sort (external sort), combining sorted result streams from multiple API sources. The heap maintains exactly the "frontier" - one candidate from each source - allowing globally correct ordering with minimal memory (O(k) heap size vs O(n) for full concatenation).
Practice Challenges
Beginner: Basic Heap Operations
Predict the output of each heappop call and explain why.
import heapq
data = [10, 5, 8, 3, 1, 7, 2]
heapq.heapify(data)
print(heapq.heappop(data))
print(heapq.heappop(data))
heapq.heappush(data, 4)
print(heapq.heappop(data))
print(data[0]) # peek without popping
Solution
Output:
1
2
3
4
Explanation:
After heapify([10, 5, 8, 3, 1, 7, 2]), the heap looks like [1, 3, 2, 5, 10, 7, 8] (or similar valid min-heap arrangement - exact layout varies, but 1 is always at index 0).
- First
heappop(): removes and returns 1 (the global minimum). Heap rearranges. - Second
heappop(): removes and returns 2 (next minimum). heappush(data, 4): inserts 4 - heap adjusts to maintain invariant. Heap now has elements {3, 4, 5, 7, 8, 10}.- Third
heappop(): removes and returns 3 (current minimum). data[0]: peeks at root without removing - returns 4 (next minimum after 3 was removed).
Key insight: heappop() always returns the current minimum, but the internal array layout is not necessarily sorted - only the root is guaranteed to be the minimum.
Intermediate: K-th Largest Element
Implement a function that finds the k-th largest element in a list using a heap. It should be more efficient than fully sorting the list.
data = [3, 2, 1, 5, 6, 4]
k = 2
# Expected: 5 (the 2nd largest element)
Solution
import heapq
def kth_largest(nums, k):
"""
Find the k-th largest element in nums.
Strategy: maintain a min-heap of size k.
- The heap always contains the k largest elements seen so far.
- The minimum of those k elements is the k-th largest.
- heap[0] is the k-th largest.
Time: O(n log k)
Space: O(k)
"""
heap = [] # min-heap of size k
for num in nums:
heapq.heappush(heap, num)
if len(heap) > k:
heapq.heappop(heap) # remove the smallest to maintain size k
# heap[0] is the k-th largest (smallest of the top-k)
return heap[0]
# Tests
print(kth_largest([3, 2, 1, 5, 6, 4], k=2)) # 5
print(kth_largest([3, 2, 3, 1, 2, 4, 5, 5, 6], k=4)) # 4
# Alternative: heapq.nlargest (same complexity, less code)
def kth_largest_v2(nums, k):
return heapq.nlargest(k, nums)[-1]
print(kth_largest_v2([3, 2, 1, 5, 6, 4], k=2)) # 5
Why a min-heap of size k works:
- We keep the k largest elements seen so far in the heap
- The heap's minimum (
heap[0]) is the smallest of those k elements - That is by definition the k-th largest element overall
- When we see a new element larger than
heap[0], we pop the current k-th largest and push the new one
Advanced: Design a Task Scheduler with Priorities and Dependencies
Design a task scheduler that processes tasks by priority, but only schedules a task when all its dependencies are complete.
tasks = {
"A": {"priority": 1, "deps": []},
"B": {"priority": 2, "deps": ["A"]},
"C": {"priority": 1, "deps": ["A"]},
"D": {"priority": 3, "deps": ["B", "C"]},
"E": {"priority": 2, "deps": []},
}
# Expected execution order based on priority (lower = higher priority)
# and dependency constraints
Solution
import heapq
from collections import defaultdict
def schedule_tasks(tasks):
"""
Topological sort with priority queue.
Algorithm:
1. Build in-degree count and reverse dependency graph
2. Initialize heap with all tasks that have no dependencies (in-degree 0)
3. Pop the highest-priority task (lowest priority number)
4. Mark it complete, decrement in-degree of dependents
5. Push any dependent tasks that now have in-degree 0
6. Repeat until heap empty or tasks remain (cycle detection)
Time: O((V + E) log V) - each task pushed/popped once, each edge processed once
Space: O(V + E)
"""
# Build dependency tracking
in_degree = {task: len(info["deps"]) for task, info in tasks.items()}
dependents = defaultdict(list) # task → list of tasks that depend on it
for task, info in tasks.items():
for dep in info["deps"]:
dependents[dep].append(task)
# Priority queue: (priority, task_name)
heap = []
for task, degree in in_degree.items():
if degree == 0: # No dependencies - ready to schedule immediately
heapq.heappush(heap, (tasks[task]["priority"], task))
execution_order = []
while heap:
priority, task = heapq.heappop(heap)
execution_order.append(task)
print(f"Executing: {task} (priority={priority})")
# Unlock dependent tasks
for dependent in dependents[task]:
in_degree[dependent] -= 1
if in_degree[dependent] == 0:
dep_priority = tasks[dependent]["priority"]
heapq.heappush(heap, (dep_priority, dependent))
# Check for cycles
if len(execution_order) < len(tasks):
remaining = set(tasks) - set(execution_order)
raise ValueError(f"Circular dependency detected among: {remaining}")
return execution_order
tasks = {
"A": {"priority": 1, "deps": []},
"B": {"priority": 2, "deps": ["A"]},
"C": {"priority": 1, "deps": ["A"]},
"D": {"priority": 3, "deps": ["B", "C"]},
"E": {"priority": 2, "deps": []},
}
order = schedule_tasks(tasks)
print(f"\nFinal order: {order}")
Output:
Executing: A (priority=1) ← A and E both ready; A has priority 1 (lower = higher priority)
Executing: E (priority=2) ← E was ready from start, priority 2
Executing: C (priority=1) ← A completed; C and B unlocked; C has priority 1
Executing: B (priority=2) ← B unlocked; priority 2
Executing: D (priority=3) ← D unlocked after both B and C; priority 3
Final order: ['A', 'E', 'C', 'B', 'D']
This pattern powers CI/CD pipeline scheduling, build system parallelization (Makefile, Bazel), and workflow orchestration systems (Airflow, Prefect).
Quick Reference
| Operation | Code | Complexity |
|---|---|---|
| Create heap from list | heapq.heapify(lst) | O(n) |
| Push element | heapq.heappush(heap, x) | O(log n) |
| Pop minimum | heapq.heappop(heap) | O(log n) |
| Peek minimum | heap[0] | O(1) |
| Push then pop | heapq.heappushpop(heap, x) | O(log n) |
| Pop then push | heapq.heapreplace(heap, x) | O(log n) |
| K largest | heapq.nlargest(k, data) | O(n log k) |
| K smallest | heapq.nsmallest(k, data) | O(n log k) |
| Max-heap push | heapq.heappush(heap, -x) | O(log n) |
| Max-heap pop | -heapq.heappop(heap) | O(log n) |
| Thread-safe priority queue | queue.PriorityQueue() | O(log n) + lock |
| Equal-priority tiebreaker | (priority, next(counter), item) | - |
| Pattern | When to Use |
|---|---|
heapq min-heap | Single-threaded priority retrieval |
| Negation max-heap | When you need maximum-first order |
(priority, counter, item) | Safe tuples with non-comparable items |
queue.PriorityQueue | Multi-threaded producer-consumer |
nlargest(k) | Top-k elements, k << n |
| Two-heap median | Running median of stream |
| Heap + in-degree | Topological sort with priorities |
| Heap in Dijkstra | Shortest path in weighted graphs |
Key Takeaways
- A heap is not sorted - only the minimum (root) is guaranteed to be the global minimum; siblings can be in any order; the flat list is not sorted
heapify()is O(n), not O(n log n) - always use it when constructing a heap from an existing list; never use n individualheappush()calls when you have the full data upfront- Python's heapq is always a min-heap - simulate max-heap by negating values; for tuples, negate only the priority field:
heapq.heappush(heap, (-priority, item)) nlargest(k)beatssorted()whenk << n-O(n log k)vsO(n log n); for k=1 usemax(); for k ≥ n/2,sorted()is better (CPython'snlargesthandles this automatically)- Use a counter tuple
(priority, counter, item)for safe equal-priority handling - preventsTypeErrorwhen the item type does not support comparison, and provides FIFO behavior as a tiebreaker - Use
queue.PriorityQueuefor multi-threaded systems -heapqis not thread-safe;PriorityQueuewraps it with locks and blockingget()/put()for safe producer-consumer patterns - Heap invariant: for any node at index
i, its parent at(i-1)//2is always ≤ both children at2i+1and2i+2; this enables O(log n) insert and extract-min without maintaining full sorted order
