Deep dive into HNSW, IVF, Product Quantization, IVFPQ, LSH, and DiskANN - how each algorithm trades recall for speed and how to choose the right one for your dataset.

How does approximate nearest neighbor work in practice?

Approximate Nearest Neighbor Algorithms covers HNSW, approximate nearest neighbor, IVF inverted file index from first principles with code examples. Free lesson at https://engineersofai.com/docs/ai-systems/vector-database-engineering/ann-algorithms

What is the difference between HNSW and IVF inverted file index?

See the full breakdown at https://engineersofai.com/docs/ai-systems/vector-database-engineering/ann-algorithms

:::tip 🎮 Interactive Playground Visualize this concept: Try the ANN Algorithms demo on the EngineersOfAI Playground - no code required. :::

Approximate Nearest Neighbor Algorithms

The Scaling Problem

The team had been running FAISS IndexFlatL2 in production for six months, and it worked fine - until it did not. With 1 million vectors at 768 dimensions, search latency was a comfortable 45ms. At 10 million, it climbed to 450ms. At 100 million, the math was clear: flat search would require 4.5 seconds per query, and the index would not even fit in RAM on a single machine.

The engineering team escalated: they needed sub-100ms search at 100M+ vectors. Someone recommended "just switch to HNSW." The engineer pulled up the FAISS documentation, saw parameters like M=16, efConstruction=200, efSearch=50, and had no mental model for what any of these controlled or what values to set. They set M=32 (it sounded bigger, therefore better) and efSearch=10 (it sounded faster), and shipped it. Recall dropped from 1.0 to 0.67. Users started complaining that the search was getting worse, not better.

This is a solvable problem, but only if you understand what HNSW is actually doing, what M and efSearch control, and how to reason about the recall-vs-latency tradeoff before you set any parameters. This lesson gives you that understanding for every major ANN algorithm: HNSW, IVF, product quantization, IVFPQ, LSH, and DiskANN.

After this lesson, you will be able to look at a dataset with given size, dimensionality, and update frequency, reason through which algorithm class fits, and tune the key parameters with a principled approach rather than guessing.

Why Exact Search Fails at Scale

The fundamental problem is that exact nearest neighbor search is $O(n \cdot d)$ per query, where $n$ is the number of vectors and $d$ is dimensionality. This scales linearly with dataset size.

For $n = 10^8$ and $d = 768$ : that is $7.68 \times 10^{10}$ floating-point multiply-accumulate operations per query. Even at 10 TFLOPS, that is 7.68 seconds per query. No caching, batching, or SIMD vectorization changes the asymptotic scaling - you are paying full price for every additional document in the corpus.

ANN algorithms escape this trap by sacrificing exactness. Rather than guaranteeing the true nearest neighbor, they guarantee finding it with high probability (typically 95–99%) while reducing the effective search to a small fraction of the full dataset. The savings are not marginal: well-tuned HNSW at 100M vectors achieves 1000+ queries per second with recall@10 above 0.95 on a single machine.

Algorithm 1: HNSW - Hierarchical Navigable Small World

HNSW (Malkov and Yashunin, 2018) is the dominant algorithm for in-memory ANN search. It is the default index in Qdrant, the primary index in Weaviate, and available in FAISS. Understanding HNSW is the most important thing you can do in this space.

The Intuition: A Navigational Highway System

Think of HNSW as a multi-layer road network. At the highest layer (layer 2), you have a few highways connecting distant nodes. At the middle layer (layer 1), you have more roads with shorter connections. At the base layer (layer 0), every vector is a node with many short-range connections to its actual neighbors.

To find the nearest neighbor of a query:

Enter the graph at the top layer at a fixed entry node
Greedily traverse to the closest node at that layer
Descend to the next layer and repeat
At the base layer, explore the local neighborhood thoroughly

The hierarchy allows the algorithm to navigate from "somewhere in the right general region" to "exactly the right local neighborhood" in $O(\log n)$ hops instead of $O(n)$ comparisons.

The Math Behind the Structure

At index construction, each new vector is inserted at a randomly chosen maximum layer $l$ , drawn from an exponential distribution:

$l = \lfloor -\ln(\text{random uniform}[0,1)) \cdot m_L \rfloor$

where $m_L = 1/\ln(M)$ and $M$ is the target number of connections per node. This exponential distribution ensures most vectors are only in the base layer, with exponentially fewer vectors at each higher layer - exactly like a skip list.

At each layer where the node is inserted, it connects to its $M$ nearest neighbors (or $M_{\max0}$ for the base layer, often set to $2M$ ).

The Two Key Parameters

M (default: 16) - the number of bidirectional links created per node at insertion. Higher M means:

Better recall (more connections, richer graph)
Larger memory footprint ( $O(M \cdot n \cdot d)$ per layer for the graph structure)
Slower index construction
Rule of thumb: M=8–16 for most use cases; M=32–64 for maximum recall on high-dimensional data

efSearch (default: 50) - the size of the dynamic candidate list during search. Higher efSearch means:

Better recall (explores more candidate nodes)
Slower query time (more distance computations)
This is the primary runtime tuning knob - adjust this without rebuilding the index

efConstruction (default: 200) - the ef parameter used during index building. Higher efConstruction:

Better index quality (more careful insertion)
Slower index building (not query time)
Typically set 2–4× higher than your target efSearch

import faiss
import numpy as np
import time

def build_hnsw_index(
    vectors: np.ndarray,
    M: int = 16,
    ef_construction: int = 200,
) -> faiss.IndexHNSWFlat:
    """
    Build FAISS HNSW index. Note: FAISS HNSW uses L2 distance internally.
    For cosine similarity, normalize vectors before indexing.
    """
    d = vectors.shape[1]
    index = faiss.IndexHNSWFlat(d, M)
    index.hnsw.efConstruction = ef_construction

    # FAISS HNSW does not support incremental add well - add all at once
    index.add(vectors.astype(np.float32))
    return index


def search_hnsw(
    index: faiss.IndexHNSWFlat,
    query: np.ndarray,
    k: int = 10,
    ef_search: int = 50,
) -> tuple:
    """Search HNSW with configurable efSearch."""
    index.hnsw.efSearch = ef_search
    distances, indices = index.search(
        query.reshape(1, -1).astype(np.float32), k
    )
    return indices[0], distances[0]


def tune_ef_search(
    index: faiss.IndexHNSWFlat,
    query_vectors: np.ndarray,
    true_neighbors: list,
    k: int = 10,
) -> None:
    """
    Find the efSearch value that hits your recall target at minimum latency.
    Run this during load testing before going live.
    """
    ef_values = [10, 20, 40, 80, 120, 200, 400]
    print(f"{'efSearch':>10} | {'Recall@10':>10} | {'P95 Latency (ms)':>18}")
    print("-" * 45)

    for ef in ef_values:
        results = []
        latencies = []

        for q in query_vectors[:200]:
            t0 = time.perf_counter()
            idx, _ = search_hnsw(index, q, k=k, ef_search=ef)
            latencies.append((time.perf_counter() - t0) * 1000)
            results.append(list(idx))

        # Compute recall
        recalls = []
        for true, pred in zip(true_neighbors[:200], results):
            true_set = set(true[:k])
            pred_set = set(pred[:k])
            recalls.append(len(true_set & pred_set) / k)

        p95 = float(np.percentile(latencies, 95))
        recall = float(np.mean(recalls))
        print(f"{ef:>10} | {recall:>10.4f} | {p95:>18.2f}")

Algorithm 2: IVF - Inverted File Index

IVF (Johnson et al., FAISS paper, 2017) takes a different approach: partition the vector space into clusters, and only search the clusters nearest to the query.

How IVF Works

Training: Run k-means clustering on a representative sample of the dataset to create $n_{list}$ centroids (cluster centers)
Indexing: Assign each vector to its nearest centroid; store an inverted list of vectors per centroid
Search: Find the $n_{probe}$ nearest centroids to the query; search only the vectors in those clusters

If you have 10,000 clusters (nlist=10000) and search 50 of them (nprobe=50), you only compute distances for roughly 0.5% of the dataset. That is a 200× speedup with recall depending on how well-clustered the data is.

IVF Parameters

nlist - number of clusters. Rule of thumb: $\sqrt{n}$ for small datasets, $4\sqrt{n}$ for better recall. For 10M vectors: nlist = 4000–16000. Training requires at least 30–50 vectors per centroid, so you need a representative training set.

nprobe - number of clusters to search at query time. This is the runtime tradeoff knob (like efSearch in HNSW):

nprobe=1: search only the nearest cluster (fast, low recall)
nprobe=nlist: search all clusters (exact search, slow)
Typical target: nprobe where recall@10 >= 0.95

def build_ivf_index(
    vectors: np.ndarray,
    nlist: int = 4096,
    metric: int = faiss.METRIC_L2,
) -> faiss.IndexIVFFlat:
    """
    Build IVF flat index. Requires a training step on a representative sample.
    """
    d = vectors.shape[1]
    quantizer = faiss.IndexFlatL2(d)  # coarse quantizer
    index = faiss.IndexIVFFlat(quantizer, d, nlist, metric)

    # Train on a representative sample
    # Need at least 30 * nlist vectors for good centroid quality
    n_train = min(len(vectors), max(50 * nlist, 100_000))
    train_sample = vectors[:n_train].astype(np.float32)
    print(f"Training IVF with {n_train} vectors, {nlist} clusters...")
    index.train(train_sample)

    # Add all vectors
    index.add(vectors.astype(np.float32))
    return index


def search_ivf(
    index: faiss.IndexIVFFlat,
    query: np.ndarray,
    k: int = 10,
    nprobe: int = 64,
) -> tuple:
    index.nprobe = nprobe
    distances, indices = index.search(
        query.reshape(1, -1).astype(np.float32), k
    )
    return indices[0], distances[0]

Algorithm 3: Product Quantization

Product Quantization (PQ) is a compression technique that reduces memory footprint by 8–64× at the cost of some recall. It is often combined with IVF to create IVFPQ, the workhorse for billion-scale search.

The Compression Idea

A 768-dimensional float32 vector takes 768 × 4 = 3072 bytes. For 100M vectors, that is 307 GB. PQ compresses vectors by:

Splitting the vector into $m$ sub-vectors of dimension $d/m$
Independently quantizing each sub-vector to one of $k^*$ centroids (typically 256, fitting in 1 byte)
Storing the centroid indices instead of the raw sub-vector

With $m=96$ sub-vectors and $k^*=256$ centroids per sub-vector: each vector is compressed to 96 bytes (from 3072 bytes - a 32× compression). The $m \times k^*$ codebooks for distance lookup are pre-computed, so approximate distances can be computed with table lookups rather than floating-point multiply-accumulate operations.

$d(q, x) \approx \sum_{j=1}^{m} d(q_j, c_{j,k_j})$

where $c_{j,k_j}$ is the centroid assigned to the $j$ -th sub-vector of $x$ , and $q_j$ is the $j$ -th sub-vector of the query.

def build_pq_index(
    vectors: np.ndarray,
    m: int = 96,      # number of sub-vectors (must divide d)
    n_bits: int = 8,  # bits per sub-vector (256 centroids)
) -> faiss.IndexPQ:
    """
    Build PQ index. Provides ~32x compression with some recall loss.
    Good for memory-constrained environments.
    """
    d = vectors.shape[1]
    assert d % m == 0, f"d={d} must be divisible by m={m}"

    index = faiss.IndexPQ(d, m, n_bits)
    index.train(vectors[:50_000].astype(np.float32))
    index.add(vectors.astype(np.float32))
    return index

Algorithm 4: IVFPQ - The Production Workhorse

IVFPQ combines IVF's coarse partitioning with PQ's compression. It is the standard approach for billion-scale vector search.

The search process:

Find the $n_{probe}$ nearest IVF centroids to the query
For each vector in those clusters, use the pre-computed PQ distance tables to quickly estimate the true distance
Return the top-k by approximate distance

IVFPQ at 1 billion vectors with nlist=65536, m=64, nprobe=64 requires roughly 64 GB RAM (for compressed vectors) rather than 3 TB for the raw vectors. Search QPS exceeds 1000 at recall@10 above 0.92 on modern hardware.

def build_ivfpq_index(
    vectors: np.ndarray,
    nlist: int = 4096,
    m: int = 96,
    n_bits: int = 8,
) -> faiss.IndexIVFPQ:
    """
    IVFPQ: the workhorse for large-scale production search.
    Combines IVF partitioning with PQ compression.
    """
    d = vectors.shape[1]
    assert d % m == 0, f"d={d} must be divisible by m={m}"

    quantizer = faiss.IndexFlatL2(d)
    index = faiss.IndexIVFPQ(quantizer, d, nlist, m, n_bits)

    n_train = min(len(vectors), 50 * nlist)
    index.train(vectors[:n_train].astype(np.float32))
    index.add(vectors.astype(np.float32))
    return index

Algorithm 5: LSH - Locality Sensitive Hashing

LSH (Indyk and Motwani, 1998) takes a fundamentally different approach. Instead of building a graph or clustering vectors, it uses a family of hash functions with the property that similar vectors collide (get the same hash) with higher probability than dissimilar vectors.

For cosine similarity, random hyperplane hashing works: for each hash function $h_i$ , draw a random vector $r_i$ , and set $h_i(x) = \text{sign}(r_i \cdot x)$ . With $k$ such hash functions, similar vectors produce identical hash signatures with probability $\propto (1 - \theta/\pi)^k$ where $\theta$ is the angle between them.

Why LSH has fallen out of favor for most applications: LSH requires a large number of hash tables to achieve good recall, which increases memory. It also struggles with non-uniform data distributions. HNSW at the same recall level is typically faster, uses less memory, and is simpler to configure. LSH remains relevant for streaming applications and cases where you cannot afford $O(n)$ index construction time.

Algorithm 6: DiskANN

DiskANN (Subramanya et al., NeurIPS 2019, later commercialized by Microsoft/Azure) solves a specific problem: HNSW requires the entire graph to fit in RAM. At 1B vectors × 768 dimensions × 4 bytes = 3 TB, that does not fit in RAM on any reasonably-priced machine.

DiskANN stores most of the index on SSD and uses a smart prefetching strategy: given a node in the graph, prefetch the disk pages for its neighbors before navigating there. The graph is compressed with PQ in memory as a guide, with full precision vectors on disk for final scoring. This enables billion-scale search on machines with 64 GB RAM instead of 3 TB.

DiskANN-style storage is now available in Qdrant (on-disk HNSW) and Azure AI Search.

Algorithm Selection Framework

When update frequency matters

Update Pattern	Best Algorithm	Why
Mostly static	IVFPQ	Compression fine; rebuild on batch update
Frequent individual inserts	HNSW	Graph supports incremental insertion
High churn (delete + insert)	HNSW with periodic reindex	PQ index does not support efficient deletes
Streaming with strict SLA	LSH or ScaNN	Hash-based updates are $O(1)$ per vector

The Recall vs QPS Curve

Every ANN algorithm exposes a tradeoff between recall@K and queries per second (QPS). The primary knob is always "how much of the search space to explore":

For HNSW: efSearch
For IVF/IVFPQ: nprobe

The correct workflow for tuning is:

Establish a recall target (e.g., recall@10 >= 0.95)
Generate a test set of queries with ground truth from exact search
Sweep the exploration parameter from low to high
Find the minimum parameter value that hits your recall target
Benchmark QPS at that parameter value under realistic concurrent load

Never tune parameters without measuring recall. A fast index with 0.67 recall is not a fast index - it is a broken one.

def recall_qps_sweep(index, query_vectors, true_neighbors, k=10):
    """
    Generate recall vs QPS tradeoff curve.
    Works for any FAISS index - set the exploration parameter externally.
    """
    # For HNSW: ef_values are efSearch values
    # For IVF: ef_values are nprobe values
    ef_values = [10, 20, 40, 80, 160, 320]

    results = []
    for ef in ef_values:
        # Caller must set index parameter before this function
        # (e.g., index.hnsw.efSearch = ef or index.nprobe = ef)
        t0 = time.perf_counter()
        n_queries = min(500, len(query_vectors))

        all_predicted = []
        for q in query_vectors[:n_queries]:
            distances, indices = index.search(
                q.reshape(1, -1).astype(np.float32), k
            )
            all_predicted.append(list(indices[0]))

        elapsed = time.perf_counter() - t0
        qps = n_queries / elapsed

        recalls = []
        for true, pred in zip(true_neighbors[:n_queries], all_predicted):
            true_set = set(true[:k])
            pred_set = set(pred[:k])
            recalls.append(len(true_set & pred_set) / k)

        results.append({
            "ef": ef,
            "recall_at_k": float(np.mean(recalls)),
            "qps": float(qps),
        })

    return results

Production Engineering Notes

Index building time is a production constraint. Building an HNSW index over 100M vectors with M=16, efConstruction=200 takes 2–4 hours on a 32-core machine. Plan index rebuilds as offline jobs that build to a staging location, validate recall, and swap atomically. Never rebuild in-place on the live serving index.

Memory capacity planning for HNSW. Each vector stored in HNSW requires: raw vector storage ( $n \cdot d \cdot 4$ bytes for float32) plus graph overhead ( $n \cdot M \cdot 2 \cdot 4$ bytes for the bidirectional links). For 10M vectors at d=768 and M=16: 10M × 768 × 4 = 30 GB for vectors + 10M × 16 × 2 × 4 = 1.28 GB for graph. Total: ~31 GB plus OS overhead. Plan for 1.5–2× to have room for index construction.

HNSW and deletes. HNSW does not efficiently support deletes. Deleting a node from the middle of the graph requires relinking its neighbors. Most implementations mark vectors as deleted (soft delete) and filter them out during search, which wastes memory and degrades recall as more vectors accumulate as soft-deleted. Rebuild the index periodically to clean up.

Common Mistakes

:::danger Setting efSearch equal to efConstruction during index building efConstruction is a build-time quality parameter. Setting efSearch very high does not "use more of the efConstruction quality" - it just makes queries slower. The index quality is baked in at construction time. At search time, use the minimum efSearch that hits your recall target. :::

:::danger Using IVFPQ without adequate training data IVF k-means requires at least 30–50 vectors per centroid for reliable cluster quality. With nlist=10000, you need at least 300,000 training vectors. Training on fewer vectors produces poor centroids, low recall, and non-deterministic behavior when vectors land in centroids far from their true neighbors. :::

:::warning Assuming algorithm A is always faster than algorithm B Performance depends heavily on hardware, dataset characteristics (dimensionality, distribution), and your specific recall target. HNSW beats IVFPQ at moderate scale; IVFPQ wins when RAM is the bottleneck. Always benchmark on your own dataset. ANN-Benchmarks results on SIFT or GIST may not reflect your embedding model's distribution. :::

:::tip Use FAISS's factory string for rapid prototyping FAISS supports an index factory string syntax for rapid index configuration:

# "IVF4096,Flat" = IVF with 4096 clusters, flat (no PQ)
# "IVF4096,PQ64" = IVFPQ with 4096 clusters and 64 sub-vectors
# "HNSW32,Flat" = HNSW with M=32
# "PCA64,IVF4096,PQ32" = PCA reduction to 64 dims, then IVFPQ
index = faiss.index_factory(d, "IVF4096,PQ96x8", faiss.METRIC_L2)

Use this for experiments. Specify parameters explicitly for production. :::

Interview Questions

Q1: Explain how HNSW achieves sub-linear search complexity.

HNSW builds a multi-layer graph where higher layers have fewer nodes with long-range connections and lower layers have more nodes with short-range connections. Search starts at the top layer and greedily navigates toward the query, descending layer by layer. Each layer reduces the search space by a factor proportional to the average node degree $M$ . The result is $O(\log n)$ hops to reach the right local region, then a thorough local search at the base layer. The key insight is that the exponential layer assignment ensures the graph has small-world properties - any two nodes can be connected in $O(\log n)$ hops.

Q2: What are the M and efSearch parameters in HNSW, and how do you tune them?

M is the number of bidirectional links per node in the graph. Higher M means richer connectivity, better recall, more memory, and slower construction. Typical values: M=8 for speed, M=16 for balance, M=32–64 for maximum recall. efSearch is the search beam width - the size of the candidate list maintained during graph traversal. Higher efSearch improves recall at the cost of more distance computations. Tune efSearch at query time: sweep from 10 to 500, measure recall@10 against exact search ground truth, and use the minimum value that hits your recall target (e.g., 0.95). Never set efSearch without measuring recall.

Q3: When would you choose IVFPQ over HNSW?

When RAM is the binding constraint. HNSW requires all vectors in memory: 100M vectors × 768 dims × 4 bytes = 307 GB, plus graph overhead. IVFPQ with 8-bit PQ compresses each vector to 96 bytes (from 3072), reducing the 100M vector footprint to ~9 GB. The tradeoff: IVFPQ has slightly lower recall at equivalent latency compared to HNSW due to quantization error. Choose IVFPQ when you cannot afford the RAM for full HNSW, or when going beyond 100M vectors on a single machine.

Q4: A dataset has high update frequency with frequent deletes. Which algorithm would you choose?

HNSW with periodic full rebuilds. HNSW supports incremental inserts (you can call index.add() at any time), which IVF/IVFPQ do not efficiently support. For deletes, use soft deletion (mark vectors as deleted in metadata, filter at query time) and schedule periodic index rebuilds to reclaim memory and restore recall degraded by tombstoned entries. The rebuild can be done in the background on a shadow copy, with atomic swap once recall validation passes.

Q5: Explain product quantization. What is the memory vs recall tradeoff?

PQ splits each d-dimensional vector into m sub-vectors of d/m dimensions each. Each sub-vector is quantized to one of 256 centroids (1 byte), learned by running separate k-means on each sub-vector dimension independently. The full vector is represented by m bytes instead of d×4 bytes - a 4d/m× compression ratio. For d=768, m=96: 32× compression (3072 bytes → 96 bytes). The recall loss comes from quantization error: the approximate distance computed from centroid lookups deviates from the true distance. In practice, recall@10 drops 3–8% compared to exact IVF at the same nprobe value. The memory savings make it worthwhile for billion-scale search where exact storage is impossible.

Q6: You need to serve 1 billion 1536-dimensional vectors at recall@10 = 0.92 and p99 latency under 200ms. Design the solution.

1536-dimensional vectors at 1B items = 6 TB raw, which requires a disk-based approach. Use DiskANN or Qdrant's on-disk HNSW: build the HNSW graph structure, store full-precision vectors on NVMe SSD (random read latency 50–100 µs), keep a PQ-compressed version in RAM as a coarse guide for navigation. For 200ms p99: NVMe I/O at 50 µs per random read × ~20–30 hops = ~1.5ms I/O latency, leaving ample room. Shard across 4–8 machines for load distribution. Use 3 replicas per shard for fault tolerance. Index construction: run as an offline batch job on a machine with large RAM, then copy the index shards to serving nodes.

The Scaling Problem​

Why Exact Search Fails at Scale​

Algorithm 1: HNSW - Hierarchical Navigable Small World​

The Intuition: A Navigational Highway System​

The Math Behind the Structure​

The Two Key Parameters​

Algorithm 2: IVF - Inverted File Index​

How IVF Works​

IVF Parameters​

Algorithm 3: Product Quantization​

The Compression Idea​

Algorithm 4: IVFPQ - The Production Workhorse​

Algorithm 5: LSH - Locality Sensitive Hashing​

Algorithm 6: DiskANN​

Algorithm Selection Framework​

When update frequency matters​

The Recall vs QPS Curve​

Production Engineering Notes​

Common Mistakes​

Interview Questions​