Skip to main content

Dictionary Hashing Mechanism - How Python Dicts Work Internally

Reading time: ~22 minutes | Level: Foundation → Engineering

Opening Challenge

Before reading further, predict every line of output:

d = {}
d[1] = "int one"
d[1.0] = "float one"
d[True] = "bool true"

print(len(d))
print(d)

print(hash(1), hash(1.0), hash(True))

import os
os.environ['PYTHONHASHSEED'] = '0' # Has no effect at runtime - why?

words = ["banana", "apple", "cherry"]
d2 = {w: len(w) for w in words}
print(list(d2.keys()))

d3 = {"a": 1}
d4 = {"b": 2}
merged = d3 | d4
print(merged)

try:
bad = {[1, 2]: "value"}
except TypeError as e:
print(e)

for k in d2:
pass # Is it safe to read during iteration? Yes. Modify? No.
1
{1: 'bool true'}
1 1 1
['banana', 'apple', 'cherry']
{'a': 1, 'b': 2}
unhashable type: 'list'

The first four lines are the most surprising. d has only one entry even though three different keys were inserted. And the final value is 'bool true'. This reveals something deep about Python's hashing and equality contract - the same thing that explains why your custom classes may silently break dictionaries if you implement __eq__ but forget __hash__. This lesson explains exactly how dicts work under the hood, slot by slot.

What You Will Learn

  • The CPython compact hash table layout introduced in Python 3.6 - indices array plus entries array
  • How hash() maps a key to a slot index, step by step
  • CPython's collision resolution strategy: open addressing with perturbation-based probing
  • Why dicts are O(1) average for get/set/delete and O(n) worst case
  • The insertion order guarantee (Python 3.7+ language spec) and how the compact layout achieves it without overhead
  • The full dict API: .get(), .setdefault(), .update(), .pop(), .items(), .keys(), .values()
  • Dict comprehensions, merging strategies, and the Python 3.9+ | operator
  • The hashability contract: what makes a valid key, implementing __hash__ and __eq__ correctly
  • Python namespaces as dicts: vars(), locals(), globals()
  • Load factor, resizing mechanics, and memory cost
  • Common pitfalls: mutable keys, modifying during iteration, {} vs set(), hash randomization

Prerequisites

You should understand Python objects, basic data types, and how == and is differ. Familiarity with the concept of arrays (random-access indexed sequences) is sufficient background for the memory layout sections.

What Is a Dict? The Abstract View

A Python dict is a mapping - an associative data structure that maps keys to values. The defining property is O(1) average-case lookup: given a key, you retrieve the value in constant time regardless of how many entries the dict contains.

# Dict literal
config = {
"host": "localhost",
"port": 5432,
"name": "appdb"
}

# Retrieval is O(1) whether the dict has 3 entries or 3 million
host = config["host"]

The implementation that delivers O(1) lookup is a hash table. But Python's hash table has evolved significantly, and the modern CPython 3.6+ implementation is materially different from the classical open-addressing scheme you might have studied in an algorithms course.

The Hash Function: From Object to Integer to Slot

The process of looking up a key in a dict follows these steps:

hash() is a built-in that dispatches to the object's __hash__ method. For built-in types, these are implemented in C and are extremely fast.

note

Python caches the hash of a string inside the string object itself after the first call. Subsequent hash(s) calls on the same string object return the cached integer without recomputation. This makes string key lookups as fast as possible since string keys dominate real-world dict usage.

The Compact Dict Layout (Python 3.6+)

Before Python 3.6, CPython used a "classic" hash table: a single flat array of (hash, key, value) triples, with many empty slots interspersed. This wasted memory and, critically, did not preserve insertion order.

In Python 3.6 (CPython implementation detail) and Python 3.7 (language guarantee), the layout was redesigned by Raymond Hettinger:

Classic layout (pre-3.6) - single flat array of (hash, key, value) triples:

Slot 0: EMPTY
Slot 1: (hash=-1234, key="host", value="localhost")
Slot 2: EMPTY
Slot 3: (hash=5678, key="port", value=5432)
Slot 4: EMPTY
Slot 5: (hash=9012, key="name", value="appdb")
Slot 6: EMPTY
Slot 7: EMPTY

Problem: entries are scattered by hash - iteration order is arbitrary. Each empty slot wastes memory for a full triple.

Compact layout (3.6+) - split into a small indices array and a dense entries array:

indices_array (one integer per hash slot):
[EMPTY, 0, EMPTY, 1, EMPTY, 2, EMPTY, EMPTY]
idx 0 1 2 3 4 5 6 7

entries_array (dense, insertion-ordered):
[0] (hash=-1234, "host", "localhost") ← inserted first
[1] (hash=5678, "port", 5432) ← inserted second
[2] (hash=9012, "name", "appdb") ← inserted third

Lookup: hash("port") % 8 = 3 → indices_array[3] = 1 → entries_array[1]
Iteration: walk entries_array[0], [1], [2] → insertion order preserved!

The compact layout delivers two wins simultaneously:

  1. Memory is conserved because empty slots in the indices_array are just small integers (1-4 bytes each on modern CPython), not full (hash, key, value) triples
  2. Insertion order is preserved naturally because entries_array is append-only - entries are added sequentially

Collision Resolution: Perturbation-Based Open Addressing

When two keys hash to the same slot index, CPython resolves the collision using open addressing - it probes a sequence of alternative slots until it finds an empty one (for insertion) or the actual key (for lookup).

Python does not use simple linear probing (index + 1, index + 2, ...). Linear probing suffers from primary clustering - occupied slots tend to cluster together, making future probes longer. Instead, CPython uses a perturbation-based formula:

# CPython's probing formula (simplified from Objects/dictobject.c)
perturb = h # initial perturbation = full hash value
index = h % size # initial slot

while True:
if indices[index] is EMPTY:
break # slot is free - insert here or report not found
if entries[indices[index]].key == lookup_key:
break # found the key
# Probe next slot
perturb >>= 5
index = (index * 5 + perturb + 1) % size

The perturb value starts as the full hash integer and is right-shifted by 5 bits on each probe. This spreads probes across the entire table rather than clustering them, significantly reducing the average probe count even at high load factors.

note

Why does CPython use this particular formula? Because hash values in Python for common types (strings, integers) have good low-bit distribution but may have patterns in the high bits. The perturbation formula mixes in the high bits over successive probes, ensuring the probe sequence visits different regions of the table.

O(1) Average, O(n) Worst Case: Why and When

The O(1) average case claim rests on two assumptions:

  1. The hash function distributes keys approximately uniformly across slots
  2. The load factor (number of entries / table capacity) is kept below a threshold

CPython resizes the table when the load factor exceeds approximately 2/3. At a 2/3 load factor with good hash distribution, the expected number of probes per lookup is a small constant (approximately 1.5). This is O(1).

The O(n) worst case arises when all keys hash to the same slot - every lookup probes through n entries. With typical keys (strings, integers), this does not occur naturally. But it can be engineered deliberately:

Attack scenario (hash collision DoS):
- Attacker sends HTTP request with 1000 headers carefully crafted so all header names
hash to the same slot in CPython's web framework's dict
- Each request triggers O(n^2) work in header parsing
- Server becomes unresponsive with modest request rates

This attack was demonstrated against PHP, Ruby, Java, and Python web servers in 2011 (CVE-2011-4815 family). The response was hash randomization.

Hash Randomization: PYTHONHASHSEED

Python 3.3+ randomizes the hash seed for strings (and bytes) at interpreter startup. The seed is different on every Python process invocation, making it computationally infeasible for an attacker to predict which strings will collide.

$ python3 -c "print(hash('hello'))"
4422788892890774319

$ python3 -c "print(hash('hello'))"
-1267823897657099895

$ python3 -c "print(hash('hello'))"
3084649067052717267

Each run produces a different hash for the same string. This breaks the DoS attack because the attacker cannot know the hash seed.

You can disable randomization for reproducible testing:

PYTHONHASHSEED=0 python3 -c "print(hash('hello'))"
# Always produces the same value with seed 0
danger

Never disable PYTHONHASHSEED in production. The randomization is a security control. Setting PYTHONHASHSEED=0 in a production environment re-enables hash collision DoS attacks against your service. The os.environ['PYTHONHASHSEED'] = '0' trick in the opening challenge does nothing - the seed is read at interpreter startup from the environment, not from os.environ at runtime. You cannot change it mid-process.

hash() is still deterministic within a single process run - the same string hashes to the same value throughout the lifetime of one Python process. This is what matters for dict correctness. Between processes (or between interpreter restarts), the values differ.

Insertion Order: Python 3.7+ Language Guarantee

Prior to CPython 3.6, dict iteration order was arbitrary and implementation-specific. The compact layout in CPython 3.6 happened to preserve insertion order, and this was elevated to a language specification guarantee in Python 3.7:

d = {}
d["z"] = 1
d["a"] = 2
d["m"] = 3

print(list(d.keys())) # ['z', 'a', 'm'] - always, guaranteed since 3.7
print(list(d.values())) # [1, 2, 3]
print(list(d.items())) # [('z', 1), ('a', 2), ('m', 3)]

This replaces collections.OrderedDict for most use cases. OrderedDict still has move_to_end() and equality that considers order, but plain dict now guarantees insertion order for free.

The Full Dict API

Getting values safely:

d = {"name": "Alice", "age": 30}

# Direct access - raises KeyError if missing
name = d["name"]

# .get() - returns None or default if missing, never raises
city = d.get("city") # None
city = d.get("city", "London") # "London"

# .setdefault() - inserts default if key missing, then returns the value
d.setdefault("country", "UK") # inserts "country": "UK"
d.setdefault("name", "Bob") # "name" already exists, unchanged; returns "Alice"

Modifying:

d["age"] = 31 # update existing key
d["email"] = "[email protected]" # add new key
del d["email"] # remove key (KeyError if missing)
val = d.pop("age") # remove and return (31)
val = d.pop("age", None) # remove and return, or None if missing

Iterating - O(n) operations:

# All three views are lazy (dict_keys, dict_values, dict_items objects)
for key in d.keys(): pass # or just: for key in d:
for val in d.values(): pass
for key, val in d.items(): pass # most common pattern
warning

Dict views (d.keys(), d.values(), d.items()) reflect the current state of the dict. If you modify the dict while iterating, you get RuntimeError: dictionary changed size during iteration. If you need to modify during iteration, iterate over a snapshot: for k in list(d.keys()):.

Merging dicts:

a = {"x": 1, "y": 2}
b = {"y": 99, "z": 3}

# Unpack merge - later dict wins on conflicts
merged1 = {**a, **b} # {"x": 1, "y": 99, "z": 3}

# Python 3.9+ union operator
merged2 = a | b # {"x": 1, "y": 99, "z": 3} - a | b: b wins
a |= b # in-place: updates a with b's entries

# .update() - mutates in place
a_copy = {"x": 1, "y": 2}
a_copy.update(b) # a_copy is now {"x": 1, "y": 99, "z": 3}

Dict Comprehensions

Dict comprehensions build a dictionary from any iterable in a single expression:

# Basic comprehension
squares = {x: x**2 for x in range(6)}
# {0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25}

# Filtered comprehension
even_squares = {x: x**2 for x in range(10) if x % 2 == 0}

# Transforming keys and values
prices = {"apple": 1.20, "banana": 0.50, "cherry": 2.00}
discounted = {item: round(price * 0.9, 2) for item, price in prices.items()}

# Inverting a dict (assumes values are unique)
original = {"a": 1, "b": 2, "c": 3}
inverted = {v: k for k, v in original.items()}
# {1: 'a', 2: 'b', 3: 'c'}

# Grouping - building a dict of lists
from itertools import groupby

words = ["apple", "ant", "bear", "bat", "cat"]
by_letter = {}
for word in words:
by_letter.setdefault(word[0], []).append(word)
# {"a": ["apple", "ant"], "b": ["bear", "bat"], "c": ["cat"]}

The Hashability Contract: __hash__ and __eq__

The contract Python requires for any object used as a dict key is:

  1. The object must implement __hash__() - it must return an integer
  2. The object must implement __eq__() - it must compare equal to itself and to other objects that should be considered the same key
  3. If a == b then hash(a) == hash(b) must hold

Constraint 3 is the critical invariant. Here is why: when Python looks up a key, it first finds the slot by hash, then confirms identity by equality. If two equal objects had different hashes, they would be stored in different slots, and lookup would fail to find one of them.

The reverse is not required: hash(a) == hash(b) does not imply a == b. That is fine - it is exactly what a collision is. The equality check is the tiebreaker.

This is why integers and floats with the same value have the same hash:

print(1 == 1.0) # True
print(hash(1) == hash(1.0)) # True - must be equal since 1 == 1.0
print(True == 1) # True
print(hash(True) == hash(1)) # True

d = {1: "int", 1.0: "float", True: "bool"}
print(len(d)) # 1 - all three keys are equal and hash to the same value
print(d) # {1: 'bool'} - first key form retained, last value wins

Implementing __hash__ and __eq__ for custom classes:

class Point:
def __init__(self, x: float, y: float):
self.x = x
self.y = y

def __eq__(self, other) -> bool:
if not isinstance(other, Point):
return NotImplemented
return self.x == other.x and self.y == other.y

def __hash__(self) -> int:
# Must hash the same fields used in __eq__
# Use tuple hashing for a robust, collision-resistant hash
return hash((self.x, self.y))

def __repr__(self):
return f"Point({self.x}, {self.y})"

p1 = Point(3.0, 4.0)
p2 = Point(3.0, 4.0)

print(p1 == p2) # True - same coordinates
print(hash(p1) == hash(p2)) # True - required by contract

d = {p1: "first point"}
print(d[p2]) # "first point" - p2 looks up the same slot as p1
danger

Critical rule: If you define __eq__ in a class, Python automatically sets __hash__ to None, making instances unhashable. This is a safety measure - if equality is redefined, the old hash is likely wrong. You must explicitly implement __hash__ alongside __eq__ if you want instances to be hashable. Forgetting __hash__ when you define __eq__ is one of the most common sources of mysterious TypeError: unhashable type errors.

Python Namespaces Are Dicts

One of Python's foundational design decisions is that namespaces are dictionaries. When you access a variable, Python is doing a dict lookup.

x = 42
y = "hello"

# Module-level namespace
print(globals()) # shows all module-level names as a dict
print(globals()['x']) # 42 - dict lookup by name

def my_function():
local_var = 99
print(locals()) # {'local_var': 99} - function-local namespace dict

# vars() works on modules, classes, and instances
class Dog:
def __init__(self, name):
self.name = name

rex = Dog("Rex")
print(vars(rex)) # {'name': 'Rex'} - instance attribute dict

When you write rex.name, Python is doing rex.__dict__['name']. Attribute access on instances is a dict lookup. Method resolution order (MRO) and class attribute lookup are also dict-based.

This means that dictionaries are not just a useful data structure - they are the runtime substrate of Python itself. Understanding dict performance is understanding Python's execution model.

Memory Layout, Load Factor, and Resizing

A Python dict allocates more table slots than it currently needs, maintaining a load factor below 2/3. When the table reaches 2/3 capacity, CPython resizes it.

import sys

d = {}
sizes = []
for i in range(20):
d[i] = i
sizes.append(sys.getsizeof(d))

# Print when the size jumps (indicating a resize)
for i, (s1, s2) in enumerate(zip(sizes, sizes[1:])):
if s2 != s1:
print(f"Resize at entry {i+1}: {s1}{s2} bytes")
Resize at entry 5: 232 → 360 bytes
Resize at entry 10: 360 → 648 bytes
Resize at entry 19: 648 → 1160 bytes

Resizing doubles the table size (actually grows to the next power of 2 above the needed capacity). All existing entries must be rehashed into the new table - this is an O(n) operation. But it happens logarithmically rarely (O(log n) resizes for n insertions), so amortized over all insertions, resizing contributes O(1) cost per insertion.

Dicts are significantly larger than equivalent lists because of the hash table overhead:

import sys

n = 1000
d = {i: i for i in range(n)}
l = list(range(n))

print(f"Dict: {sys.getsizeof(d)} bytes") # ~36904 bytes
print(f"List: {sys.getsizeof(l)} bytes") # ~8056 bytes

The dict is roughly 4-5x larger than the list because the indices array must maintain enough empty slots for O(1) lookups, and the entries array stores both keys and values (twice the data).

Pitfalls Reference

Pitfall 1: Mutable objects as dict keys

d = {}
d[[1, 2, 3]] = "value" # TypeError: unhashable type: 'list'

# Fix: use a tuple
d[(1, 2, 3)] = "value" # OK

Pitfall 2: Modifying a dict while iterating over it

d = {"a": 1, "b": 2, "c": 3}

# This raises RuntimeError
for k in d:
if d[k] == 2:
del d[k]

# Fix: iterate over a snapshot
for k in list(d.keys()):
if d[k] == 2:
del d[k]

Pitfall 3: {} creates a dict, not a set

x = {}
print(type(x)) # <class 'dict'> - NOT a set!

y = set() # correct empty set
z = {1, 2, 3} # set literal works because there are no colons

Pitfall 4: Implementing __eq__ without __hash__

class BadKey:
def __eq__(self, other):
return True # everything is "equal"
# No __hash__ - Python sets __hash__ = None automatically

bk = BadKey()
d = {}
d[bk] = "value" # TypeError: unhashable type: 'BadKey'

Pitfall 5: Assuming hash stability between processes

# This will fail across process restarts:
# Process 1 computes hash("user_key") = 12345, stores to Redis
# Process 2 has different PYTHONHASHSEED, hash("user_key") = 67890
# The cached slot is wrong

# Fix: never persist Python hash values. Use a stable algorithm like SHA256
# for cross-process or cross-machine hashing needs.
import hashlib
stable_hash = int(hashlib.sha256("user_key".encode()).hexdigest(), 16)

Quick Reference Cheatsheet

OperationSyntaxComplexityNotes
Create{"k": v} or dict(k=v)O(1){} is empty dict
Lookupd[key]O(1) avgKeyError if missing
Safe lookupd.get(key, default)O(1) avgReturns default, no exception
Insert/Updated[key] = valueO(1) amortizedMay trigger resize
Deletedel d[key]O(1) avgKeyError if missing
Popd.pop(key, default)O(1) avgReturns value
Set if missingd.setdefault(key, val)O(1) avgInserts and returns val
Check keykey in dO(1) avgUse this, not d.keys()
Lengthlen(d)O(1)
Keys viewd.keys()O(1)Lazy, reflects current dict
Values viewd.values()O(1)Lazy
Items viewd.items()O(1)Lazy, use for iteration
Updated.update(other)O(len(other))Mutates d
Merge (3.9+)d1 | d2O(n+m)New dict, right wins
In-place merged1 |= d2O(len(d2))Mutates d1
Unpack merge{**d1, **d2}O(n+m)Works in all 3.x
Comprehension{k: v for k, v in items}O(n)
Cleard.clear()O(n)Removes all entries
Copy (shallow)d.copy()O(n)
Namespace dictvars(obj)O(1)Returns __dict__

Interview Questions and Answers

Q1: How does Python achieve O(1) average-case dict lookup? When does it degrade to O(n)?

Python dicts are hash tables. Lookup computes hash(key) (a fast O(1) operation for built-in types), uses the hash to jump directly to a slot in the indices array, reads the entry index, verifies key equality, and returns the value. This is O(1) regardless of the number of entries because it is a direct array jump, not a scan. Degradation to O(n) happens when many keys hash to the same slot - collision - forcing the perturbation-based probe sequence to walk through many occupied slots. With Python's hash randomization (PYTHONHASHSEED), natural keys cannot be engineered by an attacker to all collide, keeping the average case O(1) in practice.

Q2: What happens when a hash collision occurs in CPython's dict?

CPython uses open addressing. When computing h % size produces a slot index that is already occupied by a different key, Python probes alternative slots using the formula index = (index * 5 + perturb + 1) % size where perturb = hash_value >> 5 is reduced each step. This distributes probes across the table rather than clustering them. The probe continues until either an empty slot is found (for insertion) or the exact key is found (for lookup). If lookup reaches an empty slot without finding the key, the key is not in the dict.

Q3: Why does Python 3.7 guarantee dict insertion order, and how does the compact layout achieve it without a performance penalty?

The compact layout (introduced in CPython 3.6, made a language guarantee in 3.7) separates the table into an indices_array (small integers mapping hash slots to entry positions) and a dense entries_array (the actual key-value-hash triples, stored in insertion order). Adding a new entry appends to entries_array - an O(1) operation that naturally records insertion order. Iterating the dict walks entries_array from index 0 upward, giving insertion order. The indices_array is still randomly organized by hash for O(1) lookup. The two arrays cooperate: hash-based access goes through indices_array, iteration goes directly through entries_array. There is no extra computation or bookkeeping to maintain order; it falls out of the append-only design.

Q4: Explain the hashability contract. What rule must hold between __hash__ and __eq__?

The contract: if a == b then hash(a) == hash(b) must hold. Equal objects must have equal hashes. This is because dict lookup first narrows to a slot by hash, then confirms by equality. If two equal objects had different hashes, they would land in different slots; inserting with one form and looking up with another would fail. The reverse - if hash(a) == hash(b) then a == b - is not required; that is what a collision is. When you define __eq__ in a class, Python sets __hash__ = None automatically as a safety measure, forcing you to implement __hash__ explicitly. The standard pattern is to hash a tuple of the same fields used in __eq__: return hash((self.field1, self.field2)).

Q5: What triggers a dict resize, and what is the cost?

CPython resizes when the number of used slots (including dummy/deleted slots) exceeds 2/3 of the table capacity. Resizing allocates a new table approximately twice as large (the next power of 2 above used * 3 // 2), then rehashes all live entries into the new table. This is an O(n) operation - every live key must be rehashed. But resizes occur O(log n) times over n insertions, so the amortized cost per insertion is O(1). In practice, if you know the final dict size in advance, dict.fromkeys(keys) or pre-populating avoids intermediate resizes.

Q6: What is PYTHONHASHSEED and why does setting it at runtime via os.environ have no effect?

PYTHONHASHSEED is an environment variable read by the Python interpreter at startup to seed the hash randomization for strings and bytes. The seed is read before the interpreter initializes any Python objects - before os is even importable. Setting os.environ['PYTHONHASHSEED'] = '0' changes the environment dict for child processes spawned later, but has no effect on the current process's hash seed, which is already fixed. To actually change the seed, you must set the environment variable before launching Python: PYTHONHASHSEED=0 python3 script.py. The randomization exists as a DoS defense - in 2011, researchers demonstrated that crafting HTTP headers whose names all hash to the same slot could cause O(n^2) request parsing in dict-based web frameworks. Randomization makes collision engineering computationally infeasible.

Graded Practice Challenges

Level 1 - Predict the Output

d = {"a": 1, "b": 2, "c": 3}

d["d"] = 4
d["a"] = 99
del d["b"]

print(list(d.keys()))
print(len(d))

x = d.get("b", 0)
y = d.get("a", 0)
print(x, y)

d2 = {k: v * 2 for k, v in d.items() if v > 3}
print(d2)

print("a" in d, "b" in d)
Show Answer
['a', 'c', 'd']
3
0 99
{'d': 8}
True False

Key points:

  • After d["a"] = 99: "a" is updated in-place, its insertion-order position is preserved
  • After del d["b"]: "b" is removed; remaining keys in insertion order are "a", "c", "d"
  • len(d) = 3 (a, c, d)
  • d.get("b", 0) = 0 (b was deleted)
  • d.get("a", 0) = 99 (a exists with value 99)
  • Dict comprehension: only entries where v > 3: d is {a:99, c:3, d:4}. Values > 3: a=99 → 198, d=4 → 8. Wait - c=3 is not > 3, d=4 is > 3, a=99 is > 3. So d2 = {"a": 198, "d": 8}. Actually recheck: the comprehension doubles the values, so {"a": 198, "d": 8}.

Correction: d2 = {"a": 198, "d": 8}. The answer shown above ({"d": 8}) would be if only values > 5 were included. With v > 3: a=99 qualifies (99 > 3), c=3 does not (3 is not > 3), d=4 qualifies (4 > 3). So d2 = {"a": 198, "d": 8}.

Level 2 - Debug the Code

The following function is supposed to count word frequencies in a text and return the top-N words by count. It has three bugs:

def top_words(text, n):
words = text.lower().split()
counts = {}

for word in words:
if word in counts:
counts[word] =+ 1 # Bug 1
else:
counts[word] = 1

# Sort by count descending, then alphabetically for ties
sorted_words = sorted(counts.items(), key=lambda x: x[1]) # Bug 2

return dict(sorted_words[n:]) # Bug 3

text = "the cat sat on the mat the cat"
print(top_words(text, 2))
Show Answer

Bug 1: counts[word] =+ 1 is not incrementing - =+ is assignment of positive 1, equivalent to counts[word] = 1. The increment operator is +=: counts[word] += 1.

Bug 2: key=lambda x: x[1] sorts in ascending order (smallest count first). To get descending order (most frequent first), negate the key: key=lambda x: -x[1]. For ties sorted alphabetically: key=lambda x: (-x[1], x[0]).

Bug 3: sorted_words[n:] takes everything AFTER the first n elements - that is the bottom of the list, not the top n. Should be sorted_words[:n] to take the first n elements.

Fixed code:

def top_words(text, n):
words = text.lower().split()
counts = {}

for word in words:
if word in counts:
counts[word] += 1 # Fixed: += not =+
else:
counts[word] = 1

# Sort by count descending, then alphabetically for ties
sorted_words = sorted(counts.items(), key=lambda x: (-x[1], x[0])) # Fixed

return dict(sorted_words[:n]) # Fixed: :n not n:

text = "the cat sat on the mat the cat"
print(top_words(text, 2))
# {'the': 3, 'cat': 2}

Note: This can also be written more idiomatically as:

from collections import Counter
def top_words(text, n):
return dict(Counter(text.lower().split()).most_common(n))

Level 3 - Design Challenge: Implement a HashMap from Scratch

Implement a HashMap class in Python that does not use the built-in dict or set. It must:

  1. Use an internal array (list of lists or list of slots) as the backing store
  2. Implement put(key, value), get(key, default=None), delete(key), and contains(key)
  3. Handle collisions using separate chaining (each slot holds a list of (key, value) pairs)
  4. Maintain a load factor and resize (double the array, rehash) when the load factor exceeds 0.75
  5. Support len() and __repr__ for debugging
  6. Write a test demonstrating correct behavior with at least 20 insertions, including collision cases and resize
Show Answer
class HashMap:
INITIAL_CAPACITY = 8
LOAD_FACTOR_THRESHOLD = 0.75

def __init__(self):
self._capacity = self.INITIAL_CAPACITY
self._size = 0
# Each slot is a list of (key, value) pairs - separate chaining
self._buckets: list[list] = [[] for _ in range(self._capacity)]

def _slot(self, key, capacity=None) -> int:
"""Compute the bucket index for a key."""
cap = capacity if capacity is not None else self._capacity
return hash(key) % cap

def put(self, key, value) -> None:
"""Insert or update key-value pair."""
# Check if key already exists in its bucket (update in place)
bucket = self._buckets[self._slot(key)]
for i, (k, v) in enumerate(bucket):
if k == key:
bucket[i] = (key, value)
return
# Key not found: append new pair
bucket.append((key, value))
self._size += 1
# Resize if load factor exceeded
if self._size / self._capacity > self.LOAD_FACTOR_THRESHOLD:
self._resize()

def get(self, key, default=None):
"""Retrieve value for key, or default if not found."""
bucket = self._buckets[self._slot(key)]
for k, v in bucket:
if k == key:
return v
return default

def delete(self, key) -> bool:
"""Remove key. Returns True if found, False if not."""
bucket = self._buckets[self._slot(key)]
for i, (k, v) in enumerate(bucket):
if k == key:
del bucket[i]
self._size -= 1
return True
return False

def contains(self, key) -> bool:
bucket = self._buckets[self._slot(key)]
return any(k == key for k, v in bucket)

def _resize(self) -> None:
"""Double the capacity and rehash all entries."""
new_capacity = self._capacity * 2
new_buckets: list[list] = [[] for _ in range(new_capacity)]
for bucket in self._buckets:
for key, value in bucket:
new_slot = hash(key) % new_capacity
new_buckets[new_slot].append((key, value))
self._capacity = new_capacity
self._buckets = new_buckets

def __len__(self) -> int:
return self._size

def __repr__(self) -> str:
pairs = []
for bucket in self._buckets:
pairs.extend(bucket)
return "HashMap({" + ", ".join(f"{k!r}: {v!r}" for k, v in pairs) + "})"


# Test: 20+ insertions, collision cases, resize, delete, update
hm = HashMap()

# Insert 20 entries (will trigger at least one resize at capacity 8 * 0.75 = 6)
for i in range(20):
hm.put(f"key_{i}", i * 10)

print(f"Size after 20 inserts: {len(hm)}") # 20
print(f"Capacity after resize: {hm._capacity}") # 32 (resized from 8→16→32)

# Lookup
print(hm.get("key_5")) # 50
print(hm.get("key_19")) # 190
print(hm.get("missing", -1)) # -1

# Update existing key
hm.put("key_5", 9999)
print(hm.get("key_5")) # 9999
print(len(hm)) # still 20

# Delete
result = hm.delete("key_10")
print(result) # True
print(len(hm)) # 19
print(hm.get("key_10", "gone")) # "gone"
print(hm.delete("key_10")) # False - already deleted

# Collision simulation: use integer keys that hash to same slot
# In a capacity-8 table, keys 0 and 8 both hash to slot 0
small = HashMap()
small.put(0, "zero")
small.put(8, "eight") # collides with 0 in slot 0 % 8 = 0
print(small.get(0)) # "zero"
print(small.get(8)) # "eight"

# Contains
print(hm.contains("key_0")) # True
print(hm.contains("key_10")) # False (deleted)

Design notes:

  • Separate chaining is used instead of open addressing. Each bucket is a Python list of (key, value) pairs. On collision, pairs are appended to the same bucket list. This avoids the need for tombstone markers on deletion.
  • Resize threshold of 0.75 is the standard Java HashMap default and a common interview expectation. CPython uses 2/3 (≈0.67).
  • Rehashing on resize is O(n) but occurs O(log n) times over n insertions → amortized O(1) per insert.
  • Production caveat: This implementation uses Python's hash() which is randomized per process. For cross-process or cross-machine consistency, use a stable hash function (e.g., xxHash, SHA256, MurmurHash).

Key Takeaways

  • Python dicts are compact hash tables: a dense entries_array for insertion-ordered entries plus a sparse indices_array for O(1) hash-based slot lookup
  • Lookup, insertion, and deletion are O(1) average and O(n) worst case - worst case requires engineered hash collisions, which PYTHONHASHSEED randomization prevents
  • CPython resolves collisions using perturbation-based open addressing - each probe shifts the perturbation right by 5 bits, distributing probes across the table
  • Insertion order is a language guarantee since Python 3.7, falling naturally out of the append-only entries_array design
  • The hashability contract: a == b requires hash(a) == hash(b) - defining __eq__ without __hash__ makes instances unhashable; always implement both together
  • PYTHONHASHSEED is read at interpreter startup and cannot be changed at runtime; it is a security control against hash collision DoS attacks - never disable it in production
  • Python namespaces are dicts - variable lookup, attribute access, and the entire Python object model runs on top of the dict implementation
  • Dicts use significantly more memory than lists (4-5x for the same number of elements) because of hash table overhead; this is the cost of O(1) lookup
  • Dict comprehensions, the | merge operator (Python 3.9+), and .setdefault() are idiomatic tools worth knowing for production Python code
  • Modifying a dict's size during iteration raises RuntimeError; iterate over list(d.keys()) when deletion during iteration is necessary
© 2026 EngineersOfAI. All rights reserved.