Hash Tables and Bloom Filters

A Billion Tokens and a Question of Duplicates

In early 2023, a team preparing a trillion-token pretraining corpus for a large language model faced a deceptively simple problem: they needed to remove duplicate documents before training. Their corpus contained web crawl data from multiple years, and the same Wikipedia article, news story, or Stack Overflow answer appeared hundreds or thousands of times across different crawl snapshots.

Exact deduplication by content hash is straightforward - hash each document, keep a set of seen hashes, discard documents whose hash is already in the set. The problem is the set itself. With one trillion tokens across roughly 10 billion documents, an exact hash set would require storing 10 billion 64-bit hashes in memory: $10^{10} \times 8 = 80$ GB. Their training cluster had 64 GB per CPU node. Exact in-memory deduplication was not feasible.

The solution was a Bloom filter. A Bloom filter can answer "have I seen this document hash before?" with 0% false negatives and a tunable false positive rate. Set the filter to 1% false positive rate and use 12 bits per element: $10^{10} \times 12 / 8 = 15$ GB. Fits comfortably in a single node's RAM. Accept that 1% of unique documents get incorrectly flagged as duplicates - at the scale of pretraining data, losing 1% of unique documents is irrelevant to model quality.

That is the core tension this lesson explores: exact vs approximate membership testing, exact vs approximate partitioning, exact vs approximate similarity. At the scales of modern ML, exact is often impossible. Probabilistic data structures are not a compromise - they are the only viable approach.

Hash tables and their variants underpin nearly every fast lookup in ML infrastructure: feature stores, model registries, experiment trackers, KV caches for inference, hyperparameter result caches, dataset deduplication pipelines, and consistent routing in distributed serving systems. Understanding them from the inside out - not just "use a dict" - lets you reason about load factors, collision rates, rehashing costs, and when hash-based approaches break down.

Why This Exists

The fundamental problem hash tables solve is the dictionary problem: given a dynamic set of key-value pairs, support insert, delete, and lookup in $O(1)$ expected time. The naive alternative is a sorted array with binary search: $O(\log n)$ lookup, $O(n)$ insertion. For small collections, binary search is fine. For feature stores with millions of keys being hit by thousands of queries per second, $O(1)$ vs $O(\log n)$ is the difference between 1ms and 20ms latency.

Hash tables achieve $O(1)$ average time by using a hash function to map keys to array indices, converting the search problem into an array access problem.

Bloom filters extend this idea: instead of storing the actual keys (which requires $O(n)$ space for $n$ keys), store a compact probabilistic summary that can answer "is this key in the set?" with no false negatives. The space saving over an exact hash set can be 10-100x.

Historical Context

Hash tables were invented independently by several researchers in the late 1950s. Hans Peter Luhn at IBM is often credited with the first description in 1953. The open addressing scheme was developed by Gene Amdahl, Elaine McGraw, and Arthur Samuel (of checkers-playing program fame) at IBM in 1954.

The Python dictionary evolved significantly over the years. Python 3.6 introduced a more compact representation (due to Raymond Hettinger) that reduced memory usage by 20-25% and, as a side effect, made dict insertion-ordered (though this was not a guaranteed contract until Python 3.7). The current CPython dict implementation uses a combination of open addressing with pseudorandom probing and a compact key-value storage array - worth understanding because Python dicts are everywhere in ML code.

Burton Bloom invented Bloom filters in 1970 in a paper about space-time tradeoffs in hash coding. The original application was not ML - it was a dictionary for spell-checking. The mathematical elegance (a single formula relating bits per element, hash functions, and false positive rate) made it a textbook staple. Its ML applications in deduplication were recognized much later.

MinHash was invented by Andrei Broder at AltaVista in 1997 for detecting near-duplicate web pages. Combined with Locality-Sensitive Hashing (LSH), it became the standard approach for large-scale near-duplicate detection in NLP dataset preparation.

Core Concepts

Hash Table Internals

A hash table stores key-value pairs in an array of buckets. A hash function maps each key to a bucket index. Two keys may map to the same index - this is a hash collision. Collision resolution is what distinguishes different hash table implementations.

Open Addressing: All key-value pairs are stored directly in the array. When a collision occurs, probe alternative locations until an empty slot is found. Common probing strategies:

Linear probing: try index $h$ , $h+1$ , $h+2$ , ...
Quadratic probing: try $h$ , $h+1$ , $h+4$ , $h+9$ , ...
Double hashing: try $h$ , $h + g(k)$ , $h + 2g(k)$ , ... where $g$ is a second hash function

Python's dict uses a form of open addressing with pseudorandom probing based on the hash value itself.

Separate Chaining: Each bucket is a linked list (or small array). All keys hashing to the same bucket are stored in its list. Lookup requires scanning the list, but lists stay short if the hash function is good and the load factor is reasonable.

The load factor $\alpha = n / m$ where $n$ is the number of stored keys and $m$ is the table size. As $\alpha$ approaches 1:

Open addressing degrades rapidly: expected probe length is $\approx 1/(1 - \alpha)$ (for linear probing)
Separate chaining degrades gradually: average chain length = $\alpha$

Python's dict rehashes when $\alpha > 2/3$ . Java's HashMap uses $\alpha > 0.75$ . Redis uses separate chaining with lazy rehashing.

Rehashing: When the load factor exceeds the threshold, allocate a new array at $2 \times$ the current size, and reinsert all keys. This is $O(n)$ but happens infrequently enough that the amortized cost per insertion is $O(1)$ .

# Explore Python dict internals
import sys

# Memory usage at different sizes
for n in [0, 10, 100, 1000, 10_000]:
    d = {i: i for i in range(n)}
    print(f"n={n:6d}: sys.getsizeof = {sys.getsizeof(d):8d} bytes, "
          f"per-item = {sys.getsizeof(d)/max(n,1):.1f} bytes")

# dict rehash points: powers of 2 * (4/3)
# Python's dict grows at: 8, 18, 50, 98, 226, 450, 1026, ...
import ctypes

def get_dict_capacity(d: dict) -> int:
    """
    Get the current internal array capacity of a CPython dict.
    Uses the internal ma_used and ma_version_tag fields.
    This is CPython implementation-specific.
    """
    # Not directly accessible from Python, but we can observe growth pattern
    # by watching when getsizeof jumps
    return sys.getsizeof(d)

# Observe dict resizing
prev_size = 0
d = {}
for i in range(200):
    d[i] = i
    current_size = sys.getsizeof(d)
    if current_size != prev_size:
        print(f"  Resize at n={i}: {prev_size} -> {current_size} bytes")
        prev_size = current_size

Hash Functions: Choosing and Evaluating

A good hash function for hash tables should be:

Fast: $O(1)$ to compute
Uniform: outputs are evenly distributed over the hash space
Deterministic: same input always gives same output
Avalanche: small input change causes large output change (important for avoiding clustering)

MurmurHash3: Austin Appleby's non-cryptographic hash function, used by Python for strings in pre-3.3 (before hash randomization). Excellent avalanche, very fast. The basis for many ML hashing applications.

xxHash: Yann Collet's extremely fast hash (faster than MurmurHash for small inputs). Used in FAISS, DVC, and many ML tools.

FNV (Fowler-Noll-Vo): Simple, fast, used in network protocols. Not as uniform as MurmurHash but easier to implement.

SHA-256: Cryptographic hash. Excellent distribution and collision resistance, but 10-100x slower than MurmurHash. Use for security-sensitive applications (digital signatures, content addressing in Git). For pure hash table use, overkill.

import hashlib
import mmh3  # pip install mmh3
import time

def compare_hash_functions(strings: list):
    """Compare hash function speed and distribution."""
    n = len(strings)

    # Python's built-in hash
    start = time.perf_counter()
    hashes_builtin = [hash(s) for s in strings]
    t_builtin = (time.perf_counter() - start) * 1000

    # MurmurHash3 via mmh3
    start = time.perf_counter()
    hashes_mmh3 = [mmh3.hash(s) for s in strings]
    t_mmh3 = (time.perf_counter() - start) * 1000

    # SHA-256
    start = time.perf_counter()
    hashes_sha256 = [
        int(hashlib.sha256(s.encode()).hexdigest(), 16) for s in strings
    ]
    t_sha256 = (time.perf_counter() - start) * 1000

    print(f"n={n} strings:")
    print(f"  Python hash:   {t_builtin:.1f}ms")
    print(f"  MurmurHash3:   {t_mmh3:.1f}ms")
    print(f"  SHA-256:       {t_sha256:.1f}ms  ({t_sha256/t_mmh3:.0f}x slower than mmh3)")

    # Check distribution quality (collision rate in a small table)
    table_size = n // 2  # 50% load factor
    for name, hashes in [("Python hash", hashes_builtin),
                           ("MurmurHash3", hashes_mmh3)]:
        buckets = {}
        for h in hashes:
            bucket = h % table_size
            buckets[bucket] = buckets.get(bucket, 0) + 1
        max_chain = max(buckets.values())
        collisions = sum(v - 1 for v in buckets.values() if v > 1)
        print(f"  {name}: max chain={max_chain}, collisions={collisions}")

import random
import string
test_strings = [''.join(random.choices(string.ascii_lowercase, k=20))
                for _ in range(10_000)]
compare_hash_functions(test_strings)

Consistent Hashing for Distributed ML

In a distributed ML system, you often need to partition data (feature vectors, model shards, cached results) across multiple nodes. Naive modulo hashing - assigning key $k$ to node $k \bmod N$ - is simple but breaks catastrophically when nodes are added or removed: every key is reassigned to a different node, invalidating all caches.

Consistent hashing solves this by mapping both nodes and keys to positions on a conceptual ring (hash space $[0, 2^{32})$ ). A key is assigned to the first node clockwise from it on the ring. When a node is added or removed, only keys in the affected arc are reassigned - about $N_\text{keys} / N_\text{nodes}$ keys on average, not all keys.

ML applications:

Feature store partitioning: shard features across Redis cluster nodes. Adding a node only migrates $1/N$ of features.
Model weight sharding: in pipeline-parallel training, shard layers across GPUs. Consistent hashing can route intermediate activations to the correct GPU.
KV cache in distributed inference: route generation requests by prompt hash so the same prompt always hits the same worker (cache locality).
Experiment results caching: route hyperparameter configs by their hash to distributed result store nodes.

import hashlib
import bisect
from typing import List, Optional, Dict, Any

class ConsistentHashRing:
    """
    Consistent hash ring for distributing keys across nodes.

    Uses virtual nodes (vnodes) to ensure even distribution.
    Each physical node gets n_vnodes positions on the ring,
    so the distribution is more uniform than one point per node.
    """

    def __init__(self, n_vnodes: int = 150):
        """
        n_vnodes: virtual nodes per physical node.
        Higher = more uniform distribution.
        150 is a common production value (used by Cassandra).
        """
        self.n_vnodes = n_vnodes
        self.ring: Dict[int, str] = {}    # position -> node name
        self.sorted_keys: List[int] = []  # sorted ring positions

    def _hash(self, key: str) -> int:
        """Hash a string key to a position on the ring [0, 2^32)."""
        return int(hashlib.md5(key.encode()).hexdigest(), 16) % (2 ** 32)

    def add_node(self, node: str) -> None:
        """
        Add a physical node to the ring.
        Creates n_vnodes virtual positions for it.
        """
        for vnode_id in range(self.n_vnodes):
            vnode_key = f"{node}#vnode{vnode_id}"
            position = self._hash(vnode_key)
            self.ring[position] = node
            bisect.insort(self.sorted_keys, position)

    def remove_node(self, node: str) -> None:
        """Remove a physical node and all its virtual nodes from the ring."""
        for vnode_id in range(self.n_vnodes):
            vnode_key = f"{node}#vnode{vnode_id}"
            position = self._hash(vnode_key)
            if position in self.ring:
                del self.ring[position]
                idx = bisect.bisect_left(self.sorted_keys, position)
                if idx < len(self.sorted_keys) and self.sorted_keys[idx] == position:
                    self.sorted_keys.pop(idx)

    def get_node(self, key: str) -> Optional[str]:
        """Find the node responsible for a given key."""
        if not self.ring:
            return None
        position = self._hash(key)
        # Find first node clockwise from key's position
        idx = bisect.bisect(self.sorted_keys, position)
        if idx == len(self.sorted_keys):
            idx = 0  # wrap around the ring
        return self.ring[self.sorted_keys[idx]]

    def get_distribution(self, n_keys: int = 100_000) -> Dict[str, int]:
        """Measure how evenly keys are distributed across nodes."""
        from collections import Counter
        import random
        keys = [str(random.random()) for _ in range(n_keys)]
        return dict(Counter(self.get_node(k) for k in keys))


# Demo: feature store with consistent hashing
ring = ConsistentHashRing(n_vnodes=150)
nodes = ["redis-1", "redis-2", "redis-3"]
for node in nodes:
    ring.add_node(node)

# Check distribution
dist = ring.get_distribution(100_000)
print("Key distribution across 3 nodes:")
for node, count in sorted(dist.items()):
    bar = '#' * (count // 1000)
    print(f"  {node}: {count:6d} keys  {bar}")

# Simulate adding a 4th node
print("\nAdding redis-4...")
ring.add_node("redis-4")
dist2 = ring.get_distribution(100_000)
print("Key distribution across 4 nodes:")
for node, count in sorted(dist2.items()):
    bar = '#' * (count // 1000)
    print(f"  {node}: {count:6d} keys  {bar}")

# Feature routing example
feature_keys = [f"user_{i}_embedding" for i in range(1000)]
routing = {key: ring.get_node(key) for key in feature_keys}
from collections import Counter
print(f"\nFeature routing: {dict(Counter(routing.values()))}")

Bloom Filters: The Math Behind the Magic

A Bloom filter is a bit array of size $m$ initialized to all zeros, with $k$ independent hash functions, each mapping any key to a position in $[0, m)$ .

Insert(key): compute all $k$ hash positions and set those bits to 1. Query(key): compute all $k$ hash positions. If ALL $k$ bits are 1, return "possibly in set." If ANY bit is 0, return "definitely not in set."

There are no false negatives - if an element was inserted, all its bits are set. There can be false positives - all $k$ bits for a query element may be set even though it was never inserted (set by other elements).

The false positive probability for a filter with $m$ bits, $k$ hash functions, and $n$ inserted elements:

$P(\text{false positive}) \approx \left(1 - e^{-kn/m}\right)^k$

The optimal number of hash functions for a given $m$ and $n$ :

$k^* = \frac{m}{n} \ln 2 \approx 0.693 \cdot \frac{m}{n}$

At the optimal $k$ , the false positive rate simplifies to:

$P^* \approx \left(\frac{1}{2}\right)^{k^*} = \left(\frac{1}{2}\right)^{0.693 m/n}$

Bits per element for a target false positive rate $p$ :

$\frac{m}{n} = -\frac{\ln p}{(\ln 2)^2} \approx -1.44 \log_2 p$

Some practical values:

Bits per element ( $m/n$ )	False positive rate
10	0.82%
12	0.35%
15	0.12%
20	0.012%
7	3.1%

For deduplicating a 10-billion document corpus at 0.35% false positive rate: $10^{10} \times 12 = 1.2 \times 10^{11}$ bits = 15 GB. Compare to storing 10 billion 64-bit hashes exactly: $10^{10} \times 8 = 80$ GB. The Bloom filter achieves the same deduplication quality (with 0.35% imprecision) using only 19% of the memory.

import math
import numpy as np
import mmh3  # pip install mmh3
from typing import List

class BloomFilter:
    """
    Space-efficient probabilistic set membership test.

    Guarantees: no false negatives.
    Properties: tunable false positive rate via bits_per_element.
    """

    def __init__(self, expected_n: int, target_fpr: float = 0.01):
        """
        Args:
            expected_n: expected number of elements to insert
            target_fpr: target false positive rate (e.g., 0.01 = 1%)
        """
        # Optimal filter size in bits
        self.m = self._optimal_bits(expected_n, target_fpr)
        # Optimal number of hash functions
        self.k = self._optimal_hash_count(self.m, expected_n)
        # Bit array (using numpy for efficiency)
        self.bit_array = np.zeros(self.m, dtype=np.bool_)
        self.n_inserted = 0

        print(f"BloomFilter: m={self.m:,} bits ({self.m/8/1024/1024:.1f} MB), "
              f"k={self.k} hash functions, "
              f"target FPR={target_fpr:.3%}")

    @staticmethod
    def _optimal_bits(n: int, fpr: float) -> int:
        """Number of bits for target false positive rate."""
        m = -n * math.log(fpr) / (math.log(2) ** 2)
        return int(math.ceil(m))

    @staticmethod
    def _optimal_hash_count(m: int, n: int) -> int:
        """Optimal number of hash functions."""
        k = (m / n) * math.log(2)
        return max(1, int(round(k)))

    def _get_positions(self, item: str) -> List[int]:
        """Compute k hash positions for an item."""
        # Use different seeds for each hash function
        return [mmh3.hash(item, seed=i) % self.m for i in range(self.k)]

    def add(self, item: str) -> None:
        """Insert an item into the filter."""
        for pos in self._get_positions(item):
            self.bit_array[pos] = True
        self.n_inserted += 1

    def __contains__(self, item: str) -> bool:
        """
        Test membership. Returns:
          True  = possibly in set (may be false positive)
          False = definitely not in set (no false negatives)
        """
        return all(self.bit_array[pos] for pos in self._get_positions(item))

    def estimated_fpr(self) -> float:
        """Current estimated false positive rate based on actual insertions."""
        k, n, m = self.k, self.n_inserted, self.m
        return (1 - math.exp(-k * n / m)) ** k

    def bits_set(self) -> int:
        """Number of bits currently set to 1."""
        return int(self.bit_array.sum())


# Demo: training data deduplication
import hashlib
import random
import string

def make_document(n_words: int = 50) -> str:
    """Generate a synthetic document."""
    words = [''.join(random.choices(string.ascii_lowercase, k=random.randint(3, 10)))
             for _ in range(n_words)]
    return ' '.join(words)

# Simulate a corpus with ~20% duplicates
print("Generating synthetic corpus...")
unique_docs = [make_document() for _ in range(100_000)]
corpus = unique_docs.copy()
# Add 20% duplicates
corpus += random.choices(unique_docs, k=20_000)
random.shuffle(corpus)
print(f"Corpus size: {len(corpus):,} documents ({len(unique_docs):,} unique + {len(corpus)-len(unique_docs):,} duplicates)")

# Deduplicate using exact hash set (memory intensive)
print("\nExact deduplication:")
exact_seen = set()
exact_kept = 0
for doc in corpus:
    doc_hash = hashlib.sha256(doc.encode()).hexdigest()
    if doc_hash not in exact_seen:
        exact_seen.add(doc_hash)
        exact_kept += 1
print(f"  Kept: {exact_kept:,} | Memory: ~{len(exact_seen) * 32 / 1024:.1f} KB")

# Deduplicate using Bloom filter (memory efficient)
print("\nBloom filter deduplication:")
bf = BloomFilter(expected_n=len(corpus), target_fpr=0.001)
bloom_kept = 0
false_positives = 0
for doc in corpus:
    doc_hash = hashlib.sha256(doc.encode()).hexdigest()
    if doc_hash not in bf:
        bf.add(doc_hash)
        bloom_kept += 1
    # In real use, we cannot know which are false positives
    # This simulation can check against the exact set
    else:
        pass  # would need exact set to verify

print(f"  Kept: {bloom_kept:,} | Estimated FPR: {bf.estimated_fpr():.4%}")
print(f"  Bits set: {bf.bits_set():,} / {bf.m:,} ({bf.bits_set()/bf.m:.1%})")

Counting Bloom Filters and Cuckoo Filters

Standard Bloom filters do not support deletion: once you set a bit to 1 for an element, you cannot know which other elements also set that bit. If you clear it, you introduce false negatives.

Counting Bloom filters: Replace each bit with a small counter (typically 4 bits). Insertion increments counters; deletion decrements them. Supports deletion at the cost of 4x more space. Used when the set changes over time - e.g., tracking which training examples have been seen in the current epoch and resetting after each epoch.

Cuckoo filters: Use cuckoo hashing to store fingerprints compactly. Support deletion and achieve better practical false positive rates than Bloom filters at the same space. Each bucket stores 2-4 fingerprints. On collision, the existing fingerprint is "kicked out" to its alternate bucket (like the cuckoo bird kicking eggs from a nest). Used in network packet classification and more recently in ML deduplication pipelines.

MinHash for Near-Duplicate Detection

A Bloom filter detects exact duplicates (same byte content). Near-duplicate detection identifies documents with high textual similarity even if they are not identical - e.g., the same article with minor edits, minor date changes, or different formatting.

MinHash is the key tool. The idea: represent each document by the minimum hash value seen over all its shingles (n-grams). Two documents with high Jaccard similarity (large overlap in their shingle sets) will have the same minimum hash with probability equal to their Jaccard similarity.

Jaccard similarity: for sets $A$ and $B$ : $J(A, B) = \frac{|A \cap B|}{|A \cup B|}$

The MinHash estimate: with $t$ independent hash functions, the fraction of hash functions where the two documents agree on the minimum hash is an unbiased estimate of $J(A, B)$ .

LSH (Locality-Sensitive Hashing): To find all near-duplicate pairs in a corpus of $N$ documents without comparing all $\binom{N}{2}$ pairs, LSH groups documents into buckets such that similar documents land in the same bucket with high probability. Divide the $t$ MinHash values into $b$ bands of $r$ values each. Two documents hash to the same bucket in a band if all $r$ values match. The probability of sharing at least one band (and being flagged as a near-duplicate candidate) as a function of Jaccard similarity is:

$P(\text{candidate}) = 1 - (1 - s^r)^b$

where $s$ is the true Jaccard similarity. By tuning $b$ and $r$ , you control the sensitivity threshold.

import hashlib
import re
from collections import defaultdict
from typing import List, Set, Tuple, Dict
import numpy as np

class MinHashLSH:
    """
    MinHash + LSH for near-duplicate document detection.

    Used in:
    - LLM pretraining data deduplication (GPT-3, Llama, etc.)
    - Web crawl deduplication
    - Duplicate question detection in Q&A datasets
    """

    def __init__(
        self,
        n_hashes: int = 128,     # t: number of MinHash functions
        n_bands: int = 16,        # b: bands for LSH
        shingle_size: int = 5,    # k-shingle (k-gram) size in words
        similarity_threshold: float = 0.8  # target Jaccard threshold
    ):
        self.n_hashes = n_hashes
        self.n_bands = n_bands
        self.rows_per_band = n_hashes // n_bands  # r
        self.shingle_size = shingle_size
        self.similarity_threshold = similarity_threshold

        # Generate random hash function parameters (a*x + b mod prime)
        # Using universal hashing family
        prime = (1 << 31) - 1  # Mersenne prime
        rng = np.random.default_rng(seed=42)
        self.a = rng.integers(1, prime, size=n_hashes, dtype=np.int64)
        self.b = rng.integers(0, prime, size=n_hashes, dtype=np.int64)
        self.prime = prime

        # LSH index: band_signature -> list of doc_ids
        self.lsh_index: Dict[Tuple, List[int]] = defaultdict(list)
        # Store MinHash signatures for similarity computation
        self.signatures: Dict[int, np.ndarray] = {}

    def _shingle(self, text: str) -> Set[int]:
        """
        Convert text to a set of k-shingle hashes.
        k-shingle = k consecutive words.
        """
        words = re.findall(r'\w+', text.lower())
        shingles = set()
        for i in range(len(words) - self.shingle_size + 1):
            shingle = ' '.join(words[i:i + self.shingle_size])
            shingle_hash = int(hashlib.md5(shingle.encode()).hexdigest(), 16)
            shingles.add(shingle_hash)
        return shingles

    def _minhash(self, shingles: Set[int]) -> np.ndarray:
        """
        Compute MinHash signature: t minimum hash values.
        Each hash function h_i(x) = (a_i * x + b_i) mod prime

        signature[i] = min_{x in shingles} h_i(x)
        """
        shingle_arr = np.array(list(shingles), dtype=np.int64)
        signature = np.full(self.n_hashes, self.prime, dtype=np.int64)

        # Vectorized computation: (n_hashes x n_shingles) matrix of hash values
        # Then take minimum over shingles dimension
        hashes = (np.outer(self.a, shingle_arr) + self.b[:, np.newaxis]) % self.prime
        signature = np.min(hashes, axis=1)

        return signature

    def add_document(self, doc_id: int, text: str) -> None:
        """Add a document to the index."""
        shingles = self._shingle(text)
        if len(shingles) == 0:
            return
        sig = self._minhash(shingles)
        self.signatures[doc_id] = sig

        # Insert into LSH bands
        for band_idx in range(self.n_bands):
            start = band_idx * self.rows_per_band
            end = start + self.rows_per_band
            band_signature = tuple(sig[start:end])
            bucket_key = (band_idx,) + band_signature
            self.lsh_index[bucket_key].append(doc_id)

    def find_candidates(self, doc_id: int) -> Set[int]:
        """
        Find candidate near-duplicate documents for doc_id.
        Returns doc IDs that share at least one LSH band bucket.
        """
        sig = self.signatures.get(doc_id)
        if sig is None:
            return set()

        candidates = set()
        for band_idx in range(self.n_bands):
            start = band_idx * self.rows_per_band
            end = start + self.rows_per_band
            band_signature = tuple(sig[start:end])
            bucket_key = (band_idx,) + band_signature
            for other_id in self.lsh_index[bucket_key]:
                if other_id != doc_id:
                    candidates.add(other_id)
        return candidates

    def jaccard_similarity(self, doc_id_a: int, doc_id_b: int) -> float:
        """Estimate Jaccard similarity from MinHash signatures."""
        sig_a = self.signatures.get(doc_id_a)
        sig_b = self.signatures.get(doc_id_b)
        if sig_a is None or sig_b is None:
            return 0.0
        return float(np.mean(sig_a == sig_b))

    def find_near_duplicates(self) -> List[Tuple[int, int, float]]:
        """
        Find all near-duplicate pairs with similarity above threshold.
        Returns list of (doc_id_a, doc_id_b, estimated_similarity).
        """
        pairs = set()
        near_dupes = []

        for doc_id in self.signatures:
            candidates = self.find_candidates(doc_id)
            for other_id in candidates:
                pair = (min(doc_id, other_id), max(doc_id, other_id))
                if pair not in pairs:
                    pairs.add(pair)
                    sim = self.jaccard_similarity(*pair)
                    if sim >= self.similarity_threshold:
                        near_dupes.append((*pair, sim))

        near_dupes.sort(key=lambda x: -x[2])
        return near_dupes


# Demo: dataset deduplication
lsh = MinHashLSH(n_hashes=128, n_bands=16, similarity_threshold=0.7)

# Sample documents with near-duplicates
docs = {
    0: "The transformer architecture uses self-attention to process sequences in parallel",
    1: "Transformer models use self-attention mechanisms to process input sequences in parallel",  # near-dup of 0
    2: "BERT uses bidirectional training and masked language modeling objectives",
    3: "GPT models are trained with autoregressive language modeling on large text corpora",
    4: "BERT is trained with bidirectional attention and masked language modeling",  # near-dup of 2
    5: "Convolutional neural networks process images with spatial filters and pooling layers",
}

for doc_id, text in docs.items():
    lsh.add_document(doc_id, text)

near_dupes = lsh.find_near_duplicates()
print("Near-duplicate pairs found:")
for id_a, id_b, sim in near_dupes:
    print(f"  doc {id_a} ~ doc {id_b}: similarity={sim:.3f}")
    print(f"    A: {docs[id_a][:60]}...")
    print(f"    B: {docs[id_b][:60]}...")

Hyperparameter Caching with Hash Keys

In ML experimentation, you often run the same hyperparameter configuration multiple times (across different team members, retries after failures, ablation studies). Caching results by a deterministic hash of the config avoids redundant training runs.

The key requirement: the hash must be deterministic and collision-resistant. Use SHA-256 or BLAKE2b on the JSON-serialized config (with keys sorted to ensure canonical form).

import hashlib
import json
import os
from typing import Any, Dict, Optional
from datetime import datetime

class HyperparameterCache:
    """
    Cache training results keyed by hyperparameter configuration hash.
    Avoids redundant training runs for identical configurations.
    """

    def __init__(self, cache_dir: str = "/tmp/hp_cache"):
        self.cache_dir = cache_dir
        os.makedirs(cache_dir, exist_ok=True)

    def _config_hash(self, config: Dict[str, Any]) -> str:
        """
        Compute a deterministic hash of a hyperparameter config.

        Critical: sort keys for canonical JSON representation.
        Without sorting, {"lr": 0.01, "bs": 32} and {"bs": 32, "lr": 0.01}
        would produce different hashes despite being identical configs.
        """
        canonical = json.dumps(config, sort_keys=True, ensure_ascii=True)
        return hashlib.sha256(canonical.encode()).hexdigest()[:16]  # 16 hex chars = 64 bits

    def _cache_path(self, config_hash: str) -> str:
        return os.path.join(self.cache_dir, f"{config_hash}.json")

    def get(self, config: Dict[str, Any]) -> Optional[Dict]:
        """Retrieve cached result for a config, or None if not cached."""
        h = self._config_hash(config)
        path = self._cache_path(h)
        if os.path.exists(path):
            with open(path, 'r') as f:
                cached = json.load(f)
            print(f"Cache HIT  [{h}]: val_loss={cached['results']['val_loss']:.4f}")
            return cached['results']
        print(f"Cache MISS [{h}]")
        return None

    def put(self, config: Dict[str, Any], results: Dict[str, Any]) -> str:
        """Cache results for a config. Returns the config hash."""
        h = self._config_hash(config)
        path = self._cache_path(h)
        record = {
            "config": config,
            "config_hash": h,
            "results": results,
            "timestamp": datetime.utcnow().isoformat(),
        }
        with open(path, 'w') as f:
            json.dump(record, f, indent=2)
        return h


# Demo
cache = HyperparameterCache()

configs = [
    {"lr": 0.001, "batch_size": 32, "warmup_steps": 500, "model": "bert-base"},
    {"lr": 0.001, "batch_size": 64, "warmup_steps": 500, "model": "bert-base"},
    {"lr": 0.001, "batch_size": 32, "warmup_steps": 500, "model": "bert-base"},  # duplicate!
]

for config in configs:
    result = cache.get(config)
    if result is None:
        # Simulate training
        import random
        result = {
            "val_loss": round(random.uniform(0.3, 0.8), 4),
            "val_acc": round(random.uniform(0.7, 0.95), 4),
            "train_steps": 10000,
        }
        cache.put(config, result)
        print(f"  Trained. val_loss={result['val_loss']:.4f}")
    else:
        print(f"  Skipped training (found in cache). val_loss={result['val_loss']:.4f}")

Mermaid: Hash Table Collision Resolution Strategies

Mermaid: Bloom Filter Operation

Production Engineering Notes

1. Python dict key hashing is randomized by default. Since Python 3.3, hash randomization (PYTHONHASHSEED) is enabled by default for security reasons (prevents hash flooding attacks). This means hash("some_string") gives different results across Python processes. For reproducible hashing across processes or machines (e.g., consistent hashing in a distributed system), use hashlib.md5 or hashlib.sha256 instead of Python's built-in hash().

2. Bloom filter false positive rate accumulates with corpus growth. If you build a Bloom filter sized for 10 billion documents but your corpus eventually grows to 20 billion, the actual false positive rate will be much higher than designed. Either over-provision at construction time (size for 2x expected), or rebuild periodically with a larger filter. Some production systems use multiple generation Bloom filters: keep the current generation and the previous, merge them at a threshold.

3. Consistent hashing node count vs vnode count. With too few virtual nodes per physical node ( $< 100$ ), key distribution becomes uneven. With too many ( $> 500$ ), metadata (vnode positions stored in ZooKeeper or etcd) grows large. The Cassandra default is 256 virtual nodes per physical node; a good starting point for ML feature stores is 150-200.

4. MinHash band/row selection for near-deduplication. The threshold behavior of LSH is a step function at Jaccard similarity $= (1/b)^{1/r}$ . For deduplication of web text at 80% similarity threshold, common settings are $b = 20$ bands, $r = 5$ rows, $t = 100$ total hash functions. For more aggressive deduplication at 50% similarity, use $b = 50$ , $r = 2$ . Higher band count gives more sensitive detection at the cost of a larger LSH index.

5. Hyperparameter cache collisions are catastrophic. If your config hash function has collisions - two different configs mapping to the same hash - you will return incorrect cached results silently. Use at minimum 64-bit hashes (SHA-256 truncated to 16 hex chars). For large hyperparameter search spaces (millions of configs), use 128-bit hashes. Always store the original config alongside the hash in the cache record and verify the config matches on retrieval.

:::danger Bloom Filters Only Have False Positives, Not False Negatives A Bloom filter can incorrectly claim an unseen item is in the set (false positive). It will NEVER claim a seen item is not in the set (false negative). For deduplication, this means: you may accidentally discard up to FPR% of unique documents (false positives treated as duplicates). You will never keep a true duplicate (no false negatives). Set your FPR target based on acceptable unique document loss: 1% FPR on a 1-billion-document corpus means losing up to 10 million unique documents. At pretraining scale this is usually acceptable. For smaller, higher-quality datasets where every unique example matters, use exact deduplication or set FPR to 0.01% and size the filter accordingly. :::

:::warning Python's hash() Is Not Safe for Cross-Process Applications Python 3.3+ randomizes hash seeds at process start (PYTHONHASHSEED). The same string will hash to a different value in different Python processes, across machines, and after a restart. Never use hash(key) to build a cache key that needs to persist across process boundaries, for consistent routing across multiple servers, or in any distributed system. Always use a deterministic hash like hashlib.sha256(key.encode()).hexdigest() for anything that crosses process or machine boundaries. :::

:::tip Sizing a Bloom Filter in Practice The formula for bits per element: $m/n = -1.44 \log_2(p)$ where $p$ is the target false positive rate. Practical rules of thumb: 10 bits/element gives ~1% FPR, 15 bits/element gives ~0.1% FPR, 20 bits/element gives ~0.01% FPR. For a 1-billion-element filter at 0.1% FPR: $10^9 \times 15 = 15 \times 10^9$ bits = 1.875 GB. Almost always fits in a single machine's RAM for modern servers. :::

Interview Q&A

Q1: What is consistent hashing and why is it used for distributed ML feature stores?

Consistent hashing maps both nodes and keys to positions on a virtual ring (hash space $[0, 2^{32})$ ). A key is served by the first node clockwise from its position. When a node is added or removed, only the keys in that node's arc are remapped - approximately $N_{keys}/N_{nodes}$ keys, not all keys.

In a distributed feature store (e.g., Redis Cluster serving embedding lookups during inference), naive modulo routing ( $key \bmod N$ ) remaps nearly all keys when $N$ changes - invalidating the entire cache. Consistent hashing limits remapping to $1/N$ of keys, so adding capacity to a feature store does not cause a thundering-herd cache miss. Virtual nodes (multiple ring positions per physical node) ensure even load distribution even with heterogeneous server sizes.

Q2: How does a Bloom filter work and what is the tradeoff between filter size and false positive rate?

A Bloom filter is a bit array of $m$ bits with $k$ hash functions. Inserting an element sets $k$ bits. Testing membership checks if all $k$ bits are set - if any is 0, the element is definitely absent. False positive rate: $(1 - e^{-kn/m})^k$ where $n$ is the number of inserted elements. Optimal hash count: $k^* = (m/n) \ln 2$ .

Tradeoffs: more bits per element (larger $m/n$ ) = lower false positive rate at the cost of more memory. Fewer hash functions = faster insertion/lookup but higher FPR. You cannot remove elements (bits set by multiple elements cannot be individually cleared) - use counting Bloom filters or Cuckoo filters if deletion is needed.

ML application: pretraining corpus deduplication. At 12 bits/element and 1% FPR, a 10-billion-document Bloom filter uses 15 GB vs 80 GB for an exact SHA-256 hash set - a 5.3x memory saving with only 1% imprecision.

Q3: How would you use MinHash LSH to deduplicate a 1-billion-document training corpus?

Step 1 - Shingling: represent each document as a set of 5-word shingles (k-grams). Step 2 - MinHash: compute 128 MinHash values per document using universal hash functions. Step 3 - LSH banding: split 128 MinHash values into 16 bands of 8 each. Documents sharing the exact same 8-value band signature are candidate near-duplicates. Step 4 - Deduplication: for each candidate pair, compute exact Jaccard similarity from MinHash estimates. Discard documents with Jaccard $\geq 0.8$ .

The LSH step prevents comparing all $\binom{N}{2}$ pairs. Instead, only documents sharing a band signature are compared - far fewer pairs for any reasonable similarity threshold. Time complexity: $O(N \cdot t)$ to compute signatures, $O(N \cdot b)$ for LSH insertion, $O(N \cdot \text{avg candidates})$ for pair verification. Memory: $O(N \cdot t)$ for signatures. This is how the Llama-2 training data was deduplicated.

Q4: What is the Python dict's load factor and what happens when it is exceeded?

Python's dict rehashes when the number of entries exceeds 2/3 of the table's capacity. When this threshold is crossed, a new array of approximately $8/3 \times$ the current capacity is allocated (Python uses powers of 2, rounded up). All existing entries are reinserted into the new table using their hash values. This rehash is $O(n)$ but amortized over all insertions, the cost is $O(1)$ per insert.

Practical implication: if you are inserting a known number of keys $n$ into a dict and performance matters, you cannot pre-size it in standard Python (unlike Java's HashMap(initialCapacity)). However, you can use dict.fromkeys() for key-only dicts, or initialize with the full {k: v for k, v in zip(keys, vals)} which avoids repeated resizing during construction.

Q5: How would you design a hyperparameter result cache for a distributed ML team?

Components: (1) Deterministic config hash using SHA-256 on JSON-serialized config with sorted keys - this is the cache key. (2) Store layer - S3 or a shared network filesystem, keyed by {hash[:8]}/{full_hash}.json (two-level directory to avoid too many files in one dir). (3) The stored record should contain: config, config hash, results (metrics), git commit hash, timestamp, hardware spec. (4) On retrieval, always verify the stored config matches the requested config to catch hash collisions (rare but critical to detect). (5) Invalidation: if you change evaluation code, bump a "schema version" in the cache key to automatically invalidate old results.

Gotcha: floating-point hyperparameters need careful serialization. json.dumps({"lr": 0.1}) may produce different strings for the same float in different Python versions due to float representation changes. Use json.dumps(config, sort_keys=True) and ensure floats are always represented in full precision with a custom encoder that rounds to significant figures.

Q6: What is locality-sensitive hashing and how does it differ from regular hashing?

Regular hashing is designed to minimize collisions - similar inputs should hash to different buckets. LSH is designed to maximize collisions for similar inputs - items near each other in the input space should be mapped to the same bucket with high probability.

For MinHash LSH, the probability that two documents hash to the same bucket in a band is approximately $s^r$ where $s$ is their Jaccard similarity and $r$ is the band size. For $r = 5$ and $s = 0.8$ : $0.8^5 = 0.33$ - 33% chance per band. With $b = 20$ bands, the probability of sharing at least one band is $1 - (1 - 0.33)^{20} \approx 0.999$ - they will almost certainly be found as candidates. For $s = 0.3$ : per-band probability is $0.3^5 = 0.002$ , and the chance of sharing any band is $1 - (1-0.002)^{20} \approx 0.039$ - only 4% chance of being flagged as candidates.

This is the "amplification" property of LSH: high similarity pairs are reliably detected while low-similarity pairs are rarely flagged, dramatically reducing the number of candidate pairs to verify.

A Billion Tokens and a Question of Duplicates​

Why This Exists​

Historical Context​

Core Concepts​

Hash Table Internals​

Hash Functions: Choosing and Evaluating​

Consistent Hashing for Distributed ML​

Bloom Filters: The Math Behind the Magic​

Counting Bloom Filters and Cuckoo Filters​

MinHash for Near-Duplicate Detection​

Hyperparameter Caching with Hash Keys​

Mermaid: Hash Table Collision Resolution Strategies​

Mermaid: Bloom Filter Operation​

Production Engineering Notes​

Interview Q&A​