Skip to main content

:::tip ๐ŸŽฎ Interactive Playground Visualize this concept: Try the ANN Algorithms demo on the EngineersOfAI Playground - no code required. :::

Approximate Nearest Neighbor Algorithms

The Scaling Problemโ€‹

The team had been running FAISS IndexFlatL2 in production for six months, and it worked fine - until it did not. With 1 million vectors at 768 dimensions, search latency was a comfortable 45ms. At 10 million, it climbed to 450ms. At 100 million, the math was clear: flat search would require 4.5 seconds per query, and the index would not even fit in RAM on a single machine.

The engineering team escalated: they needed sub-100ms search at 100M+ vectors. Someone recommended "just switch to HNSW." The engineer pulled up the FAISS documentation, saw parameters like M=16, efConstruction=200, efSearch=50, and had no mental model for what any of these controlled or what values to set. They set M=32 (it sounded bigger, therefore better) and efSearch=10 (it sounded faster), and shipped it. Recall dropped from 1.0 to 0.67. Users started complaining that the search was getting worse, not better.

This is a solvable problem, but only if you understand what HNSW is actually doing, what M and efSearch control, and how to reason about the recall-vs-latency tradeoff before you set any parameters. This lesson gives you that understanding for every major ANN algorithm: HNSW, IVF, product quantization, IVFPQ, LSH, and DiskANN.

After this lesson, you will be able to look at a dataset with given size, dimensionality, and update frequency, reason through which algorithm class fits, and tune the key parameters with a principled approach rather than guessing.


Why Exact Search Fails at Scaleโ€‹

The fundamental problem is that exact nearest neighbor search is O(nโ‹…d)O(n \cdot d) per query, where nn is the number of vectors and dd is dimensionality. This scales linearly with dataset size.

For n=108n = 10^8 and d=768d = 768: that is 7.68ร—10107.68 \times 10^{10} floating-point multiply-accumulate operations per query. Even at 10 TFLOPS, that is 7.68 seconds per query. No caching, batching, or SIMD vectorization changes the asymptotic scaling - you are paying full price for every additional document in the corpus.

ANN algorithms escape this trap by sacrificing exactness. Rather than guaranteeing the true nearest neighbor, they guarantee finding it with high probability (typically 95โ€“99%) while reducing the effective search to a small fraction of the full dataset. The savings are not marginal: well-tuned HNSW at 100M vectors achieves 1000+ queries per second with recall@10 above 0.95 on a single machine.


Algorithm 1: HNSW - Hierarchical Navigable Small Worldโ€‹

HNSW (Malkov and Yashunin, 2018) is the dominant algorithm for in-memory ANN search. It is the default index in Qdrant, the primary index in Weaviate, and available in FAISS. Understanding HNSW is the most important thing you can do in this space.

The Intuition: A Navigational Highway Systemโ€‹

Think of HNSW as a multi-layer road network. At the highest layer (layer 2), you have a few highways connecting distant nodes. At the middle layer (layer 1), you have more roads with shorter connections. At the base layer (layer 0), every vector is a node with many short-range connections to its actual neighbors.

To find the nearest neighbor of a query:

  1. Enter the graph at the top layer at a fixed entry node
  2. Greedily traverse to the closest node at that layer
  3. Descend to the next layer and repeat
  4. At the base layer, explore the local neighborhood thoroughly

The hierarchy allows the algorithm to navigate from "somewhere in the right general region" to "exactly the right local neighborhood" in O(logโกn)O(\log n) hops instead of O(n)O(n) comparisons.

The Math Behind the Structureโ€‹

At index construction, each new vector is inserted at a randomly chosen maximum layer ll, drawn from an exponential distribution:

l=โŒŠโˆ’lnโก(randomย uniform[0,1))โ‹…mLโŒ‹l = \lfloor -\ln(\text{random uniform}[0,1)) \cdot m_L \rfloor

where mL=1/lnโก(M)m_L = 1/\ln(M) and MM is the target number of connections per node. This exponential distribution ensures most vectors are only in the base layer, with exponentially fewer vectors at each higher layer - exactly like a skip list.

At each layer where the node is inserted, it connects to its MM nearest neighbors (or Mmaxโก0M_{\max0} for the base layer, often set to 2M2M).

The Two Key Parametersโ€‹

M (default: 16) - the number of bidirectional links created per node at insertion. Higher M means:

  • Better recall (more connections, richer graph)
  • Larger memory footprint (O(Mโ‹…nโ‹…d)O(M \cdot n \cdot d) per layer for the graph structure)
  • Slower index construction
  • Rule of thumb: M=8โ€“16 for most use cases; M=32โ€“64 for maximum recall on high-dimensional data

efSearch (default: 50) - the size of the dynamic candidate list during search. Higher efSearch means:

  • Better recall (explores more candidate nodes)
  • Slower query time (more distance computations)
  • This is the primary runtime tuning knob - adjust this without rebuilding the index

efConstruction (default: 200) - the ef parameter used during index building. Higher efConstruction:

  • Better index quality (more careful insertion)
  • Slower index building (not query time)
  • Typically set 2โ€“4ร— higher than your target efSearch
import faiss
import numpy as np
import time

def build_hnsw_index(
vectors: np.ndarray,
M: int = 16,
ef_construction: int = 200,
) -> faiss.IndexHNSWFlat:
"""
Build FAISS HNSW index. Note: FAISS HNSW uses L2 distance internally.
For cosine similarity, normalize vectors before indexing.
"""
d = vectors.shape[1]
index = faiss.IndexHNSWFlat(d, M)
index.hnsw.efConstruction = ef_construction

# FAISS HNSW does not support incremental add well - add all at once
index.add(vectors.astype(np.float32))
return index


def search_hnsw(
index: faiss.IndexHNSWFlat,
query: np.ndarray,
k: int = 10,
ef_search: int = 50,
) -> tuple:
"""Search HNSW with configurable efSearch."""
index.hnsw.efSearch = ef_search
distances, indices = index.search(
query.reshape(1, -1).astype(np.float32), k
)
return indices[0], distances[0]


def tune_ef_search(
index: faiss.IndexHNSWFlat,
query_vectors: np.ndarray,
true_neighbors: list,
k: int = 10,
) -> None:
"""
Find the efSearch value that hits your recall target at minimum latency.
Run this during load testing before going live.
"""
ef_values = [10, 20, 40, 80, 120, 200, 400]
print(f"{'efSearch':>10} | {'Recall@10':>10} | {'P95 Latency (ms)':>18}")
print("-" * 45)

for ef in ef_values:
results = []
latencies = []

for q in query_vectors[:200]:
t0 = time.perf_counter()
idx, _ = search_hnsw(index, q, k=k, ef_search=ef)
latencies.append((time.perf_counter() - t0) * 1000)
results.append(list(idx))

# Compute recall
recalls = []
for true, pred in zip(true_neighbors[:200], results):
true_set = set(true[:k])
pred_set = set(pred[:k])
recalls.append(len(true_set & pred_set) / k)

p95 = float(np.percentile(latencies, 95))
recall = float(np.mean(recalls))
print(f"{ef:>10} | {recall:>10.4f} | {p95:>18.2f}")

Algorithm 2: IVF - Inverted File Indexโ€‹

IVF (Johnson et al., FAISS paper, 2017) takes a different approach: partition the vector space into clusters, and only search the clusters nearest to the query.

How IVF Worksโ€‹

  1. Training: Run k-means clustering on a representative sample of the dataset to create nlistn_{list} centroids (cluster centers)
  2. Indexing: Assign each vector to its nearest centroid; store an inverted list of vectors per centroid
  3. Search: Find the nproben_{probe} nearest centroids to the query; search only the vectors in those clusters

If you have 10,000 clusters (nlist=10000) and search 50 of them (nprobe=50), you only compute distances for roughly 0.5% of the dataset. That is a 200ร— speedup with recall depending on how well-clustered the data is.

IVF Parametersโ€‹

nlist - number of clusters. Rule of thumb: n\sqrt{n} for small datasets, 4n4\sqrt{n} for better recall. For 10M vectors: nlist = 4000โ€“16000. Training requires at least 30โ€“50 vectors per centroid, so you need a representative training set.

nprobe - number of clusters to search at query time. This is the runtime tradeoff knob (like efSearch in HNSW):

  • nprobe=1: search only the nearest cluster (fast, low recall)
  • nprobe=nlist: search all clusters (exact search, slow)
  • Typical target: nprobe where recall@10 >= 0.95
def build_ivf_index(
vectors: np.ndarray,
nlist: int = 4096,
metric: int = faiss.METRIC_L2,
) -> faiss.IndexIVFFlat:
"""
Build IVF flat index. Requires a training step on a representative sample.
"""
d = vectors.shape[1]
quantizer = faiss.IndexFlatL2(d) # coarse quantizer
index = faiss.IndexIVFFlat(quantizer, d, nlist, metric)

# Train on a representative sample
# Need at least 30 * nlist vectors for good centroid quality
n_train = min(len(vectors), max(50 * nlist, 100_000))
train_sample = vectors[:n_train].astype(np.float32)
print(f"Training IVF with {n_train} vectors, {nlist} clusters...")
index.train(train_sample)

# Add all vectors
index.add(vectors.astype(np.float32))
return index


def search_ivf(
index: faiss.IndexIVFFlat,
query: np.ndarray,
k: int = 10,
nprobe: int = 64,
) -> tuple:
index.nprobe = nprobe
distances, indices = index.search(
query.reshape(1, -1).astype(np.float32), k
)
return indices[0], distances[0]

Algorithm 3: Product Quantizationโ€‹

Product Quantization (PQ) is a compression technique that reduces memory footprint by 8โ€“64ร— at the cost of some recall. It is often combined with IVF to create IVFPQ, the workhorse for billion-scale search.

The Compression Ideaโ€‹

A 768-dimensional float32 vector takes 768 ร— 4 = 3072 bytes. For 100M vectors, that is 307 GB. PQ compresses vectors by:

  1. Splitting the vector into mm sub-vectors of dimension d/md/m
  2. Independently quantizing each sub-vector to one of kโˆ—k^* centroids (typically 256, fitting in 1 byte)
  3. Storing the centroid indices instead of the raw sub-vector

With m=96m=96 sub-vectors and kโˆ—=256k^*=256 centroids per sub-vector: each vector is compressed to 96 bytes (from 3072 bytes - a 32ร— compression). The mร—kโˆ—m \times k^* codebooks for distance lookup are pre-computed, so approximate distances can be computed with table lookups rather than floating-point multiply-accumulate operations.

d(q,x)โ‰ˆโˆ‘j=1md(qj,cj,kj)d(q, x) \approx \sum_{j=1}^{m} d(q_j, c_{j,k_j})

where cj,kjc_{j,k_j} is the centroid assigned to the jj-th sub-vector of xx, and qjq_j is the jj-th sub-vector of the query.

def build_pq_index(
vectors: np.ndarray,
m: int = 96, # number of sub-vectors (must divide d)
n_bits: int = 8, # bits per sub-vector (256 centroids)
) -> faiss.IndexPQ:
"""
Build PQ index. Provides ~32x compression with some recall loss.
Good for memory-constrained environments.
"""
d = vectors.shape[1]
assert d % m == 0, f"d={d} must be divisible by m={m}"

index = faiss.IndexPQ(d, m, n_bits)
index.train(vectors[:50_000].astype(np.float32))
index.add(vectors.astype(np.float32))
return index

Algorithm 4: IVFPQ - The Production Workhorseโ€‹

IVFPQ combines IVF's coarse partitioning with PQ's compression. It is the standard approach for billion-scale vector search.

The search process:

  1. Find the nproben_{probe} nearest IVF centroids to the query
  2. For each vector in those clusters, use the pre-computed PQ distance tables to quickly estimate the true distance
  3. Return the top-k by approximate distance

IVFPQ at 1 billion vectors with nlist=65536, m=64, nprobe=64 requires roughly 64 GB RAM (for compressed vectors) rather than 3 TB for the raw vectors. Search QPS exceeds 1000 at recall@10 above 0.92 on modern hardware.

def build_ivfpq_index(
vectors: np.ndarray,
nlist: int = 4096,
m: int = 96,
n_bits: int = 8,
) -> faiss.IndexIVFPQ:
"""
IVFPQ: the workhorse for large-scale production search.
Combines IVF partitioning with PQ compression.
"""
d = vectors.shape[1]
assert d % m == 0, f"d={d} must be divisible by m={m}"

quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFPQ(quantizer, d, nlist, m, n_bits)

n_train = min(len(vectors), 50 * nlist)
index.train(vectors[:n_train].astype(np.float32))
index.add(vectors.astype(np.float32))
return index

Algorithm 5: LSH - Locality Sensitive Hashingโ€‹

LSH (Indyk and Motwani, 1998) takes a fundamentally different approach. Instead of building a graph or clustering vectors, it uses a family of hash functions with the property that similar vectors collide (get the same hash) with higher probability than dissimilar vectors.

For cosine similarity, random hyperplane hashing works: for each hash function hih_i, draw a random vector rir_i, and set hi(x)=sign(riโ‹…x)h_i(x) = \text{sign}(r_i \cdot x). With kk such hash functions, similar vectors produce identical hash signatures with probability โˆ(1โˆ’ฮธ/ฯ€)k\propto (1 - \theta/\pi)^k where ฮธ\theta is the angle between them.

Why LSH has fallen out of favor for most applications: LSH requires a large number of hash tables to achieve good recall, which increases memory. It also struggles with non-uniform data distributions. HNSW at the same recall level is typically faster, uses less memory, and is simpler to configure. LSH remains relevant for streaming applications and cases where you cannot afford O(n)O(n) index construction time.


Algorithm 6: DiskANNโ€‹

DiskANN (Subramanya et al., NeurIPS 2019, later commercialized by Microsoft/Azure) solves a specific problem: HNSW requires the entire graph to fit in RAM. At 1B vectors ร— 768 dimensions ร— 4 bytes = 3 TB, that does not fit in RAM on any reasonably-priced machine.

DiskANN stores most of the index on SSD and uses a smart prefetching strategy: given a node in the graph, prefetch the disk pages for its neighbors before navigating there. The graph is compressed with PQ in memory as a guide, with full precision vectors on disk for final scoring. This enables billion-scale search on machines with 64 GB RAM instead of 3 TB.

DiskANN-style storage is now available in Qdrant (on-disk HNSW) and Azure AI Search.


Algorithm Selection Frameworkโ€‹

When update frequency mattersโ€‹

Update PatternBest AlgorithmWhy
Mostly staticIVFPQCompression fine; rebuild on batch update
Frequent individual insertsHNSWGraph supports incremental insertion
High churn (delete + insert)HNSW with periodic reindexPQ index does not support efficient deletes
Streaming with strict SLALSH or ScaNNHash-based updates are O(1)O(1) per vector

The Recall vs QPS Curveโ€‹

Every ANN algorithm exposes a tradeoff between recall@K and queries per second (QPS). The primary knob is always "how much of the search space to explore":

  • For HNSW: efSearch
  • For IVF/IVFPQ: nprobe

The correct workflow for tuning is:

  1. Establish a recall target (e.g., recall@10 >= 0.95)
  2. Generate a test set of queries with ground truth from exact search
  3. Sweep the exploration parameter from low to high
  4. Find the minimum parameter value that hits your recall target
  5. Benchmark QPS at that parameter value under realistic concurrent load

Never tune parameters without measuring recall. A fast index with 0.67 recall is not a fast index - it is a broken one.

def recall_qps_sweep(index, query_vectors, true_neighbors, k=10):
"""
Generate recall vs QPS tradeoff curve.
Works for any FAISS index - set the exploration parameter externally.
"""
# For HNSW: ef_values are efSearch values
# For IVF: ef_values are nprobe values
ef_values = [10, 20, 40, 80, 160, 320]

results = []
for ef in ef_values:
# Caller must set index parameter before this function
# (e.g., index.hnsw.efSearch = ef or index.nprobe = ef)
t0 = time.perf_counter()
n_queries = min(500, len(query_vectors))

all_predicted = []
for q in query_vectors[:n_queries]:
distances, indices = index.search(
q.reshape(1, -1).astype(np.float32), k
)
all_predicted.append(list(indices[0]))

elapsed = time.perf_counter() - t0
qps = n_queries / elapsed

recalls = []
for true, pred in zip(true_neighbors[:n_queries], all_predicted):
true_set = set(true[:k])
pred_set = set(pred[:k])
recalls.append(len(true_set & pred_set) / k)

results.append({
"ef": ef,
"recall_at_k": float(np.mean(recalls)),
"qps": float(qps),
})

return results

Production Engineering Notesโ€‹

Index building time is a production constraint. Building an HNSW index over 100M vectors with M=16, efConstruction=200 takes 2โ€“4 hours on a 32-core machine. Plan index rebuilds as offline jobs that build to a staging location, validate recall, and swap atomically. Never rebuild in-place on the live serving index.

Memory capacity planning for HNSW. Each vector stored in HNSW requires: raw vector storage (nโ‹…dโ‹…4n \cdot d \cdot 4 bytes for float32) plus graph overhead (nโ‹…Mโ‹…2โ‹…4n \cdot M \cdot 2 \cdot 4 bytes for the bidirectional links). For 10M vectors at d=768 and M=16: 10M ร— 768 ร— 4 = 30 GB for vectors + 10M ร— 16 ร— 2 ร— 4 = 1.28 GB for graph. Total: ~31 GB plus OS overhead. Plan for 1.5โ€“2ร— to have room for index construction.

HNSW and deletes. HNSW does not efficiently support deletes. Deleting a node from the middle of the graph requires relinking its neighbors. Most implementations mark vectors as deleted (soft delete) and filter them out during search, which wastes memory and degrades recall as more vectors accumulate as soft-deleted. Rebuild the index periodically to clean up.


Common Mistakesโ€‹

:::danger Setting efSearch equal to efConstruction during index building efConstruction is a build-time quality parameter. Setting efSearch very high does not "use more of the efConstruction quality" - it just makes queries slower. The index quality is baked in at construction time. At search time, use the minimum efSearch that hits your recall target. :::

:::danger Using IVFPQ without adequate training data IVF k-means requires at least 30โ€“50 vectors per centroid for reliable cluster quality. With nlist=10000, you need at least 300,000 training vectors. Training on fewer vectors produces poor centroids, low recall, and non-deterministic behavior when vectors land in centroids far from their true neighbors. :::

:::warning Assuming algorithm A is always faster than algorithm B Performance depends heavily on hardware, dataset characteristics (dimensionality, distribution), and your specific recall target. HNSW beats IVFPQ at moderate scale; IVFPQ wins when RAM is the bottleneck. Always benchmark on your own dataset. ANN-Benchmarks results on SIFT or GIST may not reflect your embedding model's distribution. :::

:::tip Use FAISS's factory string for rapid prototyping FAISS supports an index factory string syntax for rapid index configuration:

# "IVF4096,Flat" = IVF with 4096 clusters, flat (no PQ)
# "IVF4096,PQ64" = IVFPQ with 4096 clusters and 64 sub-vectors
# "HNSW32,Flat" = HNSW with M=32
# "PCA64,IVF4096,PQ32" = PCA reduction to 64 dims, then IVFPQ
index = faiss.index_factory(d, "IVF4096,PQ96x8", faiss.METRIC_L2)

Use this for experiments. Specify parameters explicitly for production. :::


Interview Questionsโ€‹

Q1: Explain how HNSW achieves sub-linear search complexity.

HNSW builds a multi-layer graph where higher layers have fewer nodes with long-range connections and lower layers have more nodes with short-range connections. Search starts at the top layer and greedily navigates toward the query, descending layer by layer. Each layer reduces the search space by a factor proportional to the average node degree MM. The result is O(logโกn)O(\log n) hops to reach the right local region, then a thorough local search at the base layer. The key insight is that the exponential layer assignment ensures the graph has small-world properties - any two nodes can be connected in O(logโกn)O(\log n) hops.

Q2: What are the M and efSearch parameters in HNSW, and how do you tune them?

M is the number of bidirectional links per node in the graph. Higher M means richer connectivity, better recall, more memory, and slower construction. Typical values: M=8 for speed, M=16 for balance, M=32โ€“64 for maximum recall. efSearch is the search beam width - the size of the candidate list maintained during graph traversal. Higher efSearch improves recall at the cost of more distance computations. Tune efSearch at query time: sweep from 10 to 500, measure recall@10 against exact search ground truth, and use the minimum value that hits your recall target (e.g., 0.95). Never set efSearch without measuring recall.

Q3: When would you choose IVFPQ over HNSW?

When RAM is the binding constraint. HNSW requires all vectors in memory: 100M vectors ร— 768 dims ร— 4 bytes = 307 GB, plus graph overhead. IVFPQ with 8-bit PQ compresses each vector to 96 bytes (from 3072), reducing the 100M vector footprint to ~9 GB. The tradeoff: IVFPQ has slightly lower recall at equivalent latency compared to HNSW due to quantization error. Choose IVFPQ when you cannot afford the RAM for full HNSW, or when going beyond 100M vectors on a single machine.

Q4: A dataset has high update frequency with frequent deletes. Which algorithm would you choose?

HNSW with periodic full rebuilds. HNSW supports incremental inserts (you can call index.add() at any time), which IVF/IVFPQ do not efficiently support. For deletes, use soft deletion (mark vectors as deleted in metadata, filter at query time) and schedule periodic index rebuilds to reclaim memory and restore recall degraded by tombstoned entries. The rebuild can be done in the background on a shadow copy, with atomic swap once recall validation passes.

Q5: Explain product quantization. What is the memory vs recall tradeoff?

PQ splits each d-dimensional vector into m sub-vectors of d/m dimensions each. Each sub-vector is quantized to one of 256 centroids (1 byte), learned by running separate k-means on each sub-vector dimension independently. The full vector is represented by m bytes instead of dร—4 bytes - a 4d/mร— compression ratio. For d=768, m=96: 32ร— compression (3072 bytes โ†’ 96 bytes). The recall loss comes from quantization error: the approximate distance computed from centroid lookups deviates from the true distance. In practice, recall@10 drops 3โ€“8% compared to exact IVF at the same nprobe value. The memory savings make it worthwhile for billion-scale search where exact storage is impossible.

Q6: You need to serve 1 billion 1536-dimensional vectors at recall@10 = 0.92 and p99 latency under 200ms. Design the solution.

1536-dimensional vectors at 1B items = 6 TB raw, which requires a disk-based approach. Use DiskANN or Qdrant's on-disk HNSW: build the HNSW graph structure, store full-precision vectors on NVMe SSD (random read latency 50โ€“100 ยตs), keep a PQ-compressed version in RAM as a coarse guide for navigation. For 200ms p99: NVMe I/O at 50 ยตs per random read ร— ~20โ€“30 hops = ~1.5ms I/O latency, leaving ample room. Shard across 4โ€“8 machines for load distribution. Use 3 replicas per shard for fault tolerance. Index construction: run as an offline batch job on a machine with large RAM, then copy the index shards to serving nodes.

ยฉ 2026 EngineersOfAI. All rights reserved.