Skip to main content

Retrieval Algorithms and ANN

The 3-Second Query That Couldn't Ship

The ML infrastructure team at a mid-size fintech was building semantic search over 50 million financial documents - regulatory filings, analyst reports, internal research. The embedding pipeline worked beautifully. The vector store was populated. Then they ran the first production query.

3.2 seconds. For a single search.

The calculation was simple and devastating. Exact nearest neighbor search over 50M vectors at 1536 dimensions required computing 50M cosine similarities. Each computation is roughly 2d2d floating-point operations, so 50M×3072=153.650M \times 3072 = 153.6 billion FLOPs per query. A modern CPU can do roughly 100 GFLOPS. That's 1.5 seconds theoretical minimum, and in practice with memory bandwidth constraints, more like 3-5 seconds.

At 100 queries per second (their target), they'd need 300-500 CPU cores dedicated to similarity computation. The infrastructure cost was prohibitive. The latency was unusable.

The solution was approximate nearest neighbor indexing - trading a small amount of retrieval accuracy (the approximate results differ from exact results by a few percent) for a 1000x reduction in query time. The first HNSW index they built brought that 3.2-second query to 8ms with 95% recall@10 versus exact search. They shipped.

This lesson explains how ANN algorithms work, how to configure them, and how to choose between them for your latency-recall-memory budget.

Why This Exists: The Curse of Dimensionality

Nearest neighbor search is easy in 2 or 3 dimensions - you build a spatial tree (KD-tree, ball tree) and search in O(logn)O(\log n) time. But these tree-based structures break down as dimensionality increases. The "curse of dimensionality" means that in high dimensions, the distance to the nearest neighbor and the distance to the farthest neighbor converge - all points look approximately equidistant. Tree partitioning stops being informative.

At 1536 dimensions (OpenAI embedding size), any spatial tree degrades to near-linear scan performance. You need fundamentally different approaches.

The ANN research community spent the 2000s and 2010s developing approaches that work in high dimensions. The major contributions:

  • LSH (Locality-Sensitive Hashing): hash similar vectors to the same bucket. Approximate, but hash collision rates are hard to control.
  • IVF (Inverted File Index, ~2003): cluster vectors, search only nearby clusters.
  • HNSW (2016, Malkov & Yashunin): navigable small-world graph, greedy graph search.
  • ScaNN (2020, Google): anisotropic quantization tuned for inner product search.
  • DiskANN (2019, Microsoft): disk-resident graph for billion-scale datasets.

Exact Nearest Neighbor: The Baseline

Before discussing approximations, understand what exact search looks like and why it fails.

Linear scan: For each query vector qq, compute sim(q,vi)\text{sim}(q, v_i) for all nn vectors viv_i. Return the top kk.

Complexity: O(nd)O(n \cdot d) per query.

When exact search is viable:

  • Under 100K vectors: linear scan takes roughly 10-50ms depending on dimension
  • You can't tolerate approximation error (rare in practice - 99% recall is usually fine)
  • Using FAISS flat index with GPU acceleration (GPU does \sim100 TFLOPS, can handle 1M vectors in <10ms)
import faiss
import numpy as np

# Generate synthetic data
n_vectors = 100_000
dim = 1536
np.random.seed(42)
vectors = np.random.randn(n_vectors, dim).astype(np.float32)
# Normalize for cosine similarity (converts to IP search)
faiss.normalize_L2(vectors)

# Exact search (IndexFlatIP = flat inner product, works with normalized vectors)
index_flat = faiss.IndexFlatIP(dim)
index_flat.add(vectors)

# Query
query = np.random.randn(1, dim).astype(np.float32)
faiss.normalize_L2(query)

import time
t0 = time.time()
scores, indices = index_flat.search(query, k=10)
print(f"Exact search on {n_vectors} vectors: {(time.time()-t0)*1000:.1f}ms")
# ~50-200ms depending on hardware

HNSW: The Dominant Algorithm

Hierarchical Navigable Small World (Malkov & Yashunin, 2016) is the algorithm behind most production vector databases. Understanding it is essential for tuning retrieval.

The Small-World Graph Intuition

Imagine you need to find someone in a social network of 100M people. You don't check all 100M - you start with people you know who have broad connections (e.g., well-connected professionals), ask them to refer you toward your target, and navigate through 6 degrees of separation in roughly log(n)\log(n) hops.

HNSW builds exactly this kind of navigable graph. The "small-world" property means any two nodes can be reached via a short path. The "hierarchical" part means there's a multi-layer structure: top layers are sparse with long-range connections (like knowing international connectors), lower layers are dense with local connections (like knowing your immediate neighbors).

Construction

Building an HNSW index:

  1. Start with an empty graph
  2. For each new vector vv: a. Determine its insertion layer \ell by sampling from an exponential distribution (P()eP(\ell) \propto e^{-\ell}) b. Starting from the entry point, greedily navigate from the top layer down to layer +1\ell+1 c. At layer \ell down to layer 0, find the nearest neighbors of vv using beam search with beam width ef_construction d. Bidirectionally connect vv to its MM nearest neighbors at each layer (pruning to keep at most MM connections per node)

Key parameters:

  • M: Maximum connections per node per layer. Controls graph connectivity. Typical values: 8-64. Higher M = better recall, more memory, slower construction.
  • ef_construction: Beam width during construction. Controls how carefully we select neighbors. Typical values: 100-200. Higher = better graph quality, slower construction.
import faiss
import numpy as np
import time

n_vectors = 1_000_000
dim = 1536

vectors = np.random.randn(n_vectors, dim).astype(np.float32)
faiss.normalize_L2(vectors)

# HNSW index parameters
M = 32 # connections per node (memory/quality trade-off)
ef_construction = 128 # build quality (higher = better graph, slower build)

index_hnsw = faiss.IndexHNSWFlat(dim, M, faiss.METRIC_INNER_PRODUCT)
index_hnsw.hnsw.efConstruction = ef_construction

print(f"Building HNSW index on {n_vectors} vectors...")
t0 = time.time()
index_hnsw.add(vectors)
build_time = time.time() - t0
print(f"Build time: {build_time:.1f}s")

# Set ef (search beam width) - controls recall-latency trade-off at query time
index_hnsw.hnsw.efSearch = 64 # higher = more recall, slower queries

query = np.random.randn(1, dim).astype(np.float32)
faiss.normalize_L2(query)

t0 = time.time()
scores, indices = index_hnsw.search(query, k=10)
latency_ms = (time.time() - t0) * 1000
print(f"HNSW search: {latency_ms:.2f}ms")

# Measure recall@10 vs exact search
def measure_recall(approx_index, exact_index, queries, k=10):
_, exact_ids = exact_index.search(queries, k)
_, approx_ids = approx_index.search(queries, k)
recalls = []
for exact, approx in zip(exact_ids, approx_ids):
hits = len(set(exact) & set(approx))
recalls.append(hits / k)
return np.mean(recalls)

test_queries = np.random.randn(100, dim).astype(np.float32)
faiss.normalize_L2(test_queries)

index_flat = faiss.IndexFlatIP(dim)
index_flat.add(vectors)

recall = measure_recall(index_hnsw, index_flat, test_queries)
print(f"Recall@10 vs exact: {recall:.3f}")

The ef vs Recall/Latency Trade-Off

efSearch (or ef in some implementations) is the most important query-time parameter. It controls the beam width during graph search - how many candidates are tracked during navigation.

efSearchRecall@10Latency (1M vectors)
16~0.80~2ms
32~0.90~4ms
64~0.95~8ms
128~0.98~16ms
200~0.99~25ms

The right value depends on your SLA. For most RAG applications, Recall@10 of 0.95+ at under 10ms is achievable and sufficient.

IVF: Cluster-Based Retrieval

Inverted File Index partitions the vector space into nlistnlist clusters (using k-means). To search, it identifies the nprobenprobe closest clusters to the query and searches only within those clusters.

Intuition: If your query is about financial regulation, the most relevant documents are probably clustered near other financial documents. You don't need to check the cooking recipe cluster.

Strengths: Very memory-efficient. Training the cluster centroids is a one-time cost. Easy to understand and debug.

Limitations: Boundary artifacts - vectors near cluster boundaries may be in the wrong cluster. Low nprobe = fast but poor recall; high nprobe = good recall but approaches linear scan.

import faiss
import numpy as np

n_vectors = 1_000_000
dim = 1536
nlist = 1024 # number of clusters - rule of thumb: sqrt(n) to 4*sqrt(n)

vectors = np.random.randn(n_vectors, dim).astype(np.float32)
faiss.normalize_L2(vectors)

# IVF requires training on a sample to learn cluster centroids
quantizer = faiss.IndexFlatIP(dim) # exact search for centroid assignment
index_ivf = faiss.IndexIVFFlat(quantizer, dim, nlist, faiss.METRIC_INNER_PRODUCT)

# Train on a sample (usually 10-100x nlist samples)
train_size = min(100 * nlist, n_vectors)
index_ivf.train(vectors[:train_size])
index_ivf.add(vectors)

# nprobe: how many clusters to search - the recall/latency dial
index_ivf.nprobe = 32 # check 32/1024 = 3% of clusters

query = np.random.randn(1, dim).astype(np.float32)
faiss.normalize_L2(query)
scores, indices = index_ivf.search(query, k=10)

nlist selection:

  • Under 1M vectors: nlist = int(4 * sqrt(n_vectors)) - roughly 4000 for 1M vectors
  • Over 1M vectors: nlist = 65536 (256 IVF centers with sub-quantization)

nprobe selection: Start at nprobe = nlist / 32 (check ~3% of clusters). Measure recall. Increase until you hit recall target or latency budget.

IVF-PQ: Memory Compression with Product Quantization

At scale (100M+ vectors), storing full 1536-dimensional float32 vectors requires enormous memory. 100M × 1536 × 4 bytes = 614 GB. This doesn't fit in RAM.

Product Quantization (PQ) compresses vectors by dividing the dd-dimensional space into mm subspaces of d/md/m dimensions each, and encoding each subspace as a code from a codebook of kk^* centroids (typically k=256k^* = 256). Each subspace is stored as 1 byte (the centroid index). A 1536-dim vector is compressed from 6144 bytes to 96 bytes (at m=96m=96) - a 64x compression ratio.

import faiss
import numpy as np

n_vectors = 10_000_000 # 10M vectors
dim = 1536

vectors = np.random.randn(n_vectors, dim).astype(np.float32)
faiss.normalize_L2(vectors)

# IVF-PQ: cluster-based search with compressed vector storage
nlist = 8192 # clusters
M = 96 # number of PQ subspaces (dim must be divisible by M)
# 1536 / 96 = 16 dims per subspace
# Each subspace uses 1 byte (256 centroids)
# Total per vector: 96 bytes (vs 6144 bytes for float32)

quantizer = faiss.IndexFlatIP(dim)
index_ivfpq = faiss.IndexIVFPQ(quantizer, dim, nlist, M, 8)
# 8 = bits per code (256 centroids per subspace)

# Training requires sufficient data to learn both IVF and PQ centroids
train_vectors = vectors[:500_000]
print("Training IVF-PQ index...")
index_ivfpq.train(train_vectors)
index_ivfpq.add(vectors)

print(f"Memory with full vectors: {n_vectors * dim * 4 / 1e9:.1f} GB")
# IVF-PQ memory: much smaller
# n_vectors * M bytes + nlist * dim * 4 bytes (centroids)
compressed_gb = (n_vectors * M + nlist * dim * 4) / 1e9
print(f"Memory with IVF-PQ: {compressed_gb:.1f} GB")

index_ivfpq.nprobe = 64
query = np.random.randn(1, dim).astype(np.float32)
faiss.normalize_L2(query)
scores, indices = index_ivfpq.search(query, k=10)

The compression-recall trade-off: PQ introduces quantization error. At m=96m=96 (96 bytes per 1536-dim vector), recall@10 typically drops from ~0.99 (exact) to ~0.85. At m=192m=192 (192 bytes), recall is ~0.92. The right mm depends on your quality requirements.

ScaNN: Anisotropic Quantization

Scalable Nearest Neighbors (ScaNN) was developed at Google (2020) and is used in production at Google Search scale. The key insight: when you're looking for the top-kk neighbors, you care most about quantizing the vectors that have the highest dot product with the query - not the "average" vector. Standard PQ treats all vectors equally; ScaNN applies anisotropic quantization that prioritizes accuracy for high-similarity vectors.

ScaNN achieves 10-20% better recall at the same memory budget compared to standard IVF-PQ on most benchmarks. It's available as the scann Python package:

import scann
import numpy as np

n_vectors = 1_000_000
dim = 1536
vectors = np.random.randn(n_vectors, dim).astype(np.float32)
# ScaNN uses MIPS (Maximum Inner Product Search) - normalize for cosine
vectors = vectors / np.linalg.norm(vectors, axis=1, keepdims=True)

searcher = (
scann.scann_ops_pybind.builder(vectors, 10, "dot_product")
.tree(num_leaves=2000, num_leaves_to_search=100, training_sample_size=250000)
.score_ah(2, anisotropic_quantization_threshold=0.2) # key ScaNN innovation
.reorder(100) # rerank top-100 with exact scores
.build()
)

query = np.random.randn(1, dim).astype(np.float32)
query = query / np.linalg.norm(query)
neighbors, distances = searcher.search_batched(query)

ScaNN is the right choice when you need maximum performance at fixed memory budget and don't mind the Google-ecosystem dependency.

DiskANN (Microsoft, 2019) solves the problem that even IVF-PQ can't fully address: datasets that are too large to fit their index in RAM at all.

DiskANN builds an HNSW-like graph but stores vectors on SSD (NVMe). The graph structure is carefully designed to minimize random disk accesses during search - it batches disk reads for graph neighbors, exploiting NVMe's high random-read IOPS.

At 1 billion vectors at 1536 dimensions:

  • Full float32: 6 TB RAM - impossible
  • IVF-PQ (96 bytes): 96 GB RAM - expensive
  • DiskANN: ~100 GB SSD + ~10 GB RAM (for compressed graph in RAM) - feasible

DiskANN is available as diskann-py or through Microsoft's AzureSearch. It's the production solution at Microsoft-scale document retrieval.

The ann-benchmarks.com Reference

ann-benchmarks.com is the canonical reference for ANN algorithm comparisons. It benchmarks algorithms across multiple real-world datasets at various recall levels, reporting queries-per-second (QPS) vs recall@10.

How to read the benchmark:

  • Each point on the QPS vs recall curve represents a specific parameter setting (e.g., different efSearch values for HNSW)
  • You want algorithms that are furthest to the upper-right (high QPS at high recall)
  • HNSW consistently dominates: at recall@10 = 0.95, HNSW achieves 10-100x higher QPS than IVF-Flat on most datasets

The key insight from benchmarks: The right algorithm depends on your index size and latency requirements:

ScenarioRecommended Algorithm
Under 500K vectorsHNSW-Flat
500K - 50M, memory constrainedIVF-PQ
50M - 1B, memory not constrainedHNSW-PQ or ScaNN
1B+, won't fit in RAMDiskANN
Maximum accuracy, GPU availableFlat + GPU

FAISS HNSW vs IVF Comparison

import faiss
import numpy as np
import time

def benchmark_index(index, vectors, queries, k=10, name=""):
"""Measure recall@k and QPS vs exact search."""
# Build flat index for ground truth
flat = faiss.IndexFlatIP(vectors.shape[1])
flat.add(vectors)
_, exact_ids = flat.search(queries, k)

# Query the approximate index
t0 = time.time()
n_warmup = 10
for q in queries[:n_warmup]:
index.search(q.reshape(1, -1), k)
warmup_time = time.time() - t0

t0 = time.time()
_, approx_ids = index.search(queries, k)
total_time = time.time() - t0

# Compute recall@k
recalls = [
len(set(e) & set(a)) / k
for e, a in zip(exact_ids, approx_ids)
]
mean_recall = np.mean(recalls)
qps = len(queries) / total_time

print(f"{name}: Recall@{k}={mean_recall:.3f}, QPS={qps:.0f}, "
f"Latency={(1000/qps):.1f}ms")

n = 500_000
d = 128 # use lower dim for fast demo
rng = np.random.default_rng(42)
data = rng.standard_normal((n, d)).astype(np.float32)
faiss.normalize_L2(data)
queries = rng.standard_normal((100, d)).astype(np.float32)
faiss.normalize_L2(queries)

# HNSW
hnsw = faiss.IndexHNSWFlat(d, 32, faiss.METRIC_INNER_PRODUCT)
hnsw.hnsw.efConstruction = 128
hnsw.hnsw.efSearch = 64
hnsw.add(data)
benchmark_index(hnsw, data, queries, name="HNSW(M=32, ef=64)")

# IVF-Flat
nlist = 512
quantizer = faiss.IndexFlatIP(d)
ivf = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_INNER_PRODUCT)
ivf.train(data)
ivf.add(data)
ivf.nprobe = 32
benchmark_index(ivf, data, queries, name="IVF-Flat(nlist=512, nprobe=32)")

# IVF-PQ
ivfpq = faiss.IndexIVFPQ(quantizer, d, nlist, 16, 8)
ivfpq.train(data)
ivfpq.add(data)
ivfpq.nprobe = 32
benchmark_index(ivfpq, data, queries, name="IVF-PQ(nlist=512, m=16, nprobe=32)")

Parameter Tuning Guide

Production Engineering Notes

Index persistence: FAISS indexes don't auto-persist. Always save after building: faiss.write_index(index, "my_index.faiss"). For production, version your index files alongside your vector data.

Index rebuild vs incremental update: HNSW supports efficient dynamic insertion - adding new vectors doesn't require a full rebuild. However, FAISS's HNSW implementation requires a full rebuild for deletions (it doesn't support true deletes). IVF-based indexes also require retraining if your data distribution changes significantly. Plan a periodic rebuild schedule.

Multi-GPU scaling: FAISS supports GPU acceleration and multi-GPU with faiss.StandardGpuResources(). A single A100 GPU can search 1M 1536-dim vectors in under 5ms, versus 50-200ms on CPU. For latency-critical applications with millions of vectors, GPU FAISS is transformative.

Thread safety: FAISS indexes are not thread-safe for concurrent adds. For concurrent reads (queries), they're safe. Use read-write locking or index swapping (build new index offline, swap atomically) for production update patterns.

Common Mistakes

:::danger Building HNSW Without Normalization for Cosine Similarity FAISS HNSW with METRIC_INNER_PRODUCT correctly computes cosine similarity only on L2-normalized vectors. On unnormalized vectors, it computes raw dot product, which correlates with vector magnitude, not semantic similarity. Always normalize before adding to the index and normalize query vectors at search time. Use faiss.normalize_L2(vectors) in-place. :::

:::danger Using ef_construction = ef_Search Value ef_construction controls build quality - how carefully neighbors are selected during index construction. efSearch controls search quality - how many candidates are explored during query. Setting efSearch > ef_construction is wasteful: you can't recover graph quality at query time that wasn't built in during construction. Always build with high ef_construction (128-200). Tune efSearch downward at query time to hit your latency budget. :::

:::warning Not Measuring Recall on Your Actual Data Algorithm benchmarks from ann-benchmarks.com use specific public datasets. Your data distribution may differ significantly. The correct approach: sample 1000+ representative queries from your actual query log with known relevant documents. Measure recall@10 for your specific index parameters. A 5% recall difference you ignored in benchmarks may be a 5% user satisfaction gap in production. :::

Interview Questions and Answers

Q: Explain how HNSW works at a high level. Why does it beat tree-based approaches in high dimensions?

A: HNSW builds a hierarchical graph where each node (vector) has connections at multiple layers. Top layers have long-range connections to globally similar vectors; lower layers have dense local connections. Search starts at the top layer, greedily moves toward the query, drops to the next layer, and repeats. This exploits the "small-world" property - any two vectors can be reached in O(logn)O(\log n) hops. Tree-based approaches (KD-trees, ball trees) fail in high dimensions because axis-aligned splits become uninformative - in 1536 dimensions, orthogonal projections don't separate similar from dissimilar vectors. HNSW avoids this by learning connections directly in the original space rather than projecting. The key parameters are M (connections per node, controlling memory and recall) and efSearch (search beam width, controlling the recall-latency trade-off).

Q: What is product quantization and what does it trade for memory savings?

A: PQ divides the dd-dimensional vector space into mm subspaces of d/md/m dimensions each. Each subspace is independently quantized using a codebook of kk^* centroids (typically 256). A vector is encoded as mm codebook indices - requiring mm bytes instead of d×4d \times 4 bytes for float32. The compression ratio is 4d/m4d / m - for d=1536d=1536, m=96m=96: 64x compression. The trade-off: quantization error. The reconstructed vector is not identical to the original; it's the concatenation of the nearest codebook centroids in each subspace. Similarity computed on quantized vectors has approximation error relative to exact similarity. At m=d/16m = d/16 (e.g., 96 for 1536-dim), typical recall@10 drops by 5-10% compared to exact. At m=d/8m = d/8, the drop is 1-3%. PQ is always used with IVF (IVF-PQ), where the coarse cluster selection provides additional precision.

Q: What's the difference between nlist and nprobe in an IVF index?

A: nlist is set at index build time: the number of k-means clusters the vector space is partitioned into. A good starting point is nlist = int(4 * sqrt(n)) for nn vectors. nprobe is set at query time: the number of clusters to search. With nlist=1024 and nprobe=32, each query searches 32/1024 = 3% of the vector space. Higher nprobe = higher recall + higher latency. This is the key recall-latency knob for IVF indexes. If nprobe = nlist, IVF degrades to a linear scan with centroid lookup overhead. In practice: set nlist to achieve cluster sizes of 1000-10000 vectors, then tune nprobe until you hit your recall target within your latency budget.

Q: When would you choose IVF-PQ over HNSW?

A: IVF-PQ is preferable when: (1) memory is the binding constraint - IVF-PQ compresses vectors 10-100x (60x for 1536-dim at 96 bytes/vector vs 6144 bytes float32), while HNSW stores full vectors; (2) you have 50M+ vectors that don't fit in RAM as an HNSW index; (3) build time matters - HNSW construction is O(nlogn)O(n \log n) with significant constant factors; IVF training is a one-time k-means, and adding vectors is fast. Choose HNSW when memory is not constrained and you want maximum recall per millisecond - HNSW consistently outperforms IVF-PQ at the same recall level in terms of QPS on ann-benchmarks. The practical rule: HNSW for under 50M vectors with adequate RAM; IVF-PQ for larger scale.

Q: Your ANN index achieves 95% Recall@10. What does this mean and is it good enough for production RAG?

A: Recall@10 of 0.95 means that 95% of the time, all 10 of the true nearest neighbors appear in the returned top-10 results. The other 5% of the time, at least one of the true top-10 is replaced by a slightly worse candidate. For RAG, this is generally excellent. Consider: if your embedding model correctly orders the relevant document in the top 3 for 90% of queries, and your ANN has 95% recall, the combined miss rate is roughly 5-10% of queries where ANN error causes a relevant document to be missed. This is typically far smaller than other error sources (embedding model limitations, chunking artifacts, query-document mismatch). The rare cases where the exact vs approximate difference matters can be caught by adding a reranking step, which reorders candidates using a more expensive model. In practice, tuning for 95-99% Recall@10 is the standard production target for retrieval-critical applications.

Building a FAISS Index Manager for Production

A complete production wrapper around FAISS with persistence, versioning, and recall measurement:

import faiss
import numpy as np
import pickle
import os
from pathlib import Path
from typing import List, Tuple, Optional
import time

class FAISSIndexManager:
"""
Production FAISS index manager with persistence, versioning,
and recall measurement capabilities.
"""

def __init__(
self,
dim: int = 1536,
index_type: str = "hnsw", # "hnsw", "ivf", "ivfpq", "flat"
m: int = 32, # HNSW: connections per layer
ef_construction: int = 128, # HNSW: build quality
nlist: int = 1024, # IVF: number of clusters
m_pq: int = 96, # IVF-PQ: number of PQ subspaces
storage_path: str = "./faiss_index",
):
self.dim = dim
self.index_type = index_type
self.storage_path = Path(storage_path)
self.storage_path.mkdir(exist_ok=True)
self.metadata: List[dict] = [] # parallel array of metadata

# Build the appropriate index
if index_type == "flat":
self.index = faiss.IndexFlatIP(dim)
elif index_type == "hnsw":
self.index = faiss.IndexHNSWFlat(dim, m, faiss.METRIC_INNER_PRODUCT)
self.index.hnsw.efConstruction = ef_construction
elif index_type == "ivf":
quantizer = faiss.IndexFlatIP(dim)
self.index = faiss.IndexIVFFlat(quantizer, dim, nlist, faiss.METRIC_INNER_PRODUCT)
elif index_type == "ivfpq":
quantizer = faiss.IndexFlatIP(dim)
self.index = faiss.IndexIVFPQ(quantizer, dim, nlist, m_pq, 8)
else:
raise ValueError(f"Unknown index type: {index_type}")

# Query-time parameters
if index_type == "hnsw":
self.index.hnsw.efSearch = 64
elif index_type in ("ivf", "ivfpq"):
self._nprobe = 32

def train(self, vectors: np.ndarray):
"""Train IVF-based indexes. Not needed for HNSW or Flat."""
if hasattr(self.index, 'train') and not self.index.is_trained:
print(f"Training {self.index_type} index on {len(vectors)} vectors...")
t0 = time.time()
self.index.train(vectors)
print(f"Training complete in {time.time()-t0:.1f}s")
if self.index_type in ("ivf", "ivfpq"):
self.index.nprobe = self._nprobe

def add(self, vectors: np.ndarray, metadata: Optional[List[dict]] = None):
"""Add normalized vectors to the index."""
assert vectors.shape[1] == self.dim, f"Expected dim {self.dim}, got {vectors.shape[1]}"
# Ensure normalized
norms = np.linalg.norm(vectors, axis=1, keepdims=True)
vectors = vectors / np.clip(norms, 1e-8, None)
self.index.add(vectors)
if metadata:
self.metadata.extend(metadata)
else:
self.metadata.extend([{}] * len(vectors))

def search(
self,
query: np.ndarray,
k: int = 10,
) -> Tuple[np.ndarray, np.ndarray]:
"""Search for k nearest neighbors. Returns (scores, indices)."""
if query.ndim == 1:
query = query.reshape(1, -1)
# Normalize query
norms = np.linalg.norm(query, axis=1, keepdims=True)
query = query / np.clip(norms, 1e-8, None)
return self.index.search(query, k)

def set_ef_search(self, ef: int):
"""Adjust HNSW search quality at query time."""
if self.index_type == "hnsw":
self.index.hnsw.efSearch = ef

def set_nprobe(self, nprobe: int):
"""Adjust IVF search breadth at query time."""
if self.index_type in ("ivf", "ivfpq"):
self.index.nprobe = nprobe

def measure_recall(
self,
test_vectors: np.ndarray,
k: int = 10,
n_samples: int = 100,
) -> float:
"""Measure Recall@k vs exact search on a sample of vectors."""
flat = faiss.IndexFlatIP(self.dim)
flat.add(self.index.reconstruct_n(0, self.index.ntotal)
if hasattr(self.index, 'reconstruct_n') else test_vectors)

sample_idx = np.random.choice(len(test_vectors), min(n_samples, len(test_vectors)), replace=False)
samples = test_vectors[sample_idx]

_, exact_ids = flat.search(samples, k)
_, approx_ids = self.search(samples, k)

recalls = [
len(set(e) & set(a)) / k
for e, a in zip(exact_ids, approx_ids)
]
return float(np.mean(recalls))

def save(self, name: str = "index"):
"""Persist index and metadata to disk."""
faiss.write_index(self.index, str(self.storage_path / f"{name}.faiss"))
with open(self.storage_path / f"{name}_metadata.pkl", "wb") as f:
pickle.dump(self.metadata, f)
print(f"Saved {self.index.ntotal} vectors to {self.storage_path}/{name}")

def load(self, name: str = "index"):
"""Load index from disk."""
self.index = faiss.read_index(str(self.storage_path / f"{name}.faiss"))
with open(self.storage_path / f"{name}_metadata.pkl", "rb") as f:
self.metadata = pickle.load(f)
if self.index_type in ("ivf", "ivfpq"):
self.index.nprobe = self._nprobe
print(f"Loaded {self.index.ntotal} vectors from {self.storage_path}/{name}")

@property
def size(self) -> int:
return self.index.ntotal

def info(self):
print(f"Index type: {self.index_type}")
print(f"Dimension: {self.dim}")
print(f"Vectors: {self.size}")
memory_estimate_mb = self.size * self.dim * 4 / 1e6
print(f"Memory estimate (float32): {memory_estimate_mb:.0f} MB")


# Example usage
manager = FAISSIndexManager(
dim=1536,
index_type="hnsw",
m=32,
ef_construction=128,
storage_path="./my_index"
)

# Generate and add vectors
rng = np.random.default_rng(42)
vectors = rng.standard_normal((10000, 1536)).astype(np.float32)
manager.add(vectors, metadata=[{"id": i} for i in range(10000)])

# Search
query = rng.standard_normal(1536).astype(np.float32)
scores, indices = manager.search(query, k=10)

# Tune ef for different recall/latency trade-offs
for ef in [16, 32, 64, 128]:
manager.set_ef_search(ef)
t0 = time.time()
scores, indices = manager.search(query, k=10)
latency = (time.time() - t0) * 1000
print(f"efSearch={ef}: {latency:.2f}ms")

manager.save("production_v1")
manager.info()

The Memory-Recall-Latency Cube

ANN index selection is a three-way trade-off. Visualizing it helps make decisions:

ConfigurationMemory (1M × 1536-dim)Recall@10Latency
Flat (exact)6 GB1.00200ms
HNSW M=86.5 GB0.882ms
HNSW M=327.5 GB0.975ms
HNSW M=649 GB0.999ms
IVF-Flat nprobe=166 GB0.828ms
IVF-Flat nprobe=646 GB0.9430ms
IVF-PQ m=96 nprobe=32200 MB0.835ms
IVF-PQ m=192 nprobe=64350 MB0.9110ms

For most RAG applications: HNSW with M=32 and efSearch=64 is the starting point. It offers excellent recall (97%) with fast query times and manageable memory. Only deviate from this when your constraints specifically require it.

Index Warm-Up and Cold Start Mitigation

HNSW indexes must be loaded into RAM to serve queries. After a service restart, the first few queries may be slow as the OS page cache warms. Mitigate this:

import faiss
import numpy as np
import time

class WarmableIndexManager:
"""HNSW index manager with warm-up on load."""

def __init__(self, index_path: str):
self.index_path = index_path
self.index = None

def load_and_warm(self, n_warmup_queries: int = 100):
"""Load index and run warm-up queries to populate OS page cache."""
print(f"Loading index from {self.index_path}...")
t0 = time.time()
self.index = faiss.read_index(self.index_path)
load_time = time.time() - t0
print(f"Loaded in {load_time:.1f}s ({self.index.ntotal} vectors)")

if self.index.ntotal == 0:
return

# Run warm-up queries with random vectors
print(f"Warming up with {n_warmup_queries} queries...")
dim = self.index.d
rng = np.random.default_rng(42)
warmup_queries = rng.standard_normal((n_warmup_queries, dim)).astype(np.float32)
faiss.normalize_L2(warmup_queries)

t0 = time.time()
self.index.search(warmup_queries, 10)
warmup_time = time.time() - t0
print(f"Warm-up complete in {warmup_time:.1f}s")
print(f"Expected query latency: {warmup_time*1000/n_warmup_queries:.1f}ms")

def search(self, query: np.ndarray, k: int = 10):
if query.ndim == 1:
query = query.reshape(1, -1)
norm = np.linalg.norm(query)
if norm > 0:
query = query / norm
return self.index.search(query, k)

ANN Index Monitoring in Production

Once deployed, monitor your ANN index health continuously:

import time
import statistics
from typing import List, Optional
from collections import deque
import numpy as np

class ANNIndexMonitor:
"""
Monitor ANN index query latency and recall in production.
Detects degradation and triggers alerts.
"""

def __init__(
self,
index,
exact_index, # Flat index for recall computation
window_size: int = 1000,
latency_alert_ms: float = 50.0,
recall_alert_threshold: float = 0.85,
):
self.index = index
self.exact_index = exact_index
self.latency_window = deque(maxlen=window_size)
self.recall_window = deque(maxlen=100)
self.latency_alert_ms = latency_alert_ms
self.recall_alert_threshold = recall_alert_threshold

def search_with_monitoring(
self,
query: np.ndarray,
k: int = 10,
check_recall: bool = False,
) -> tuple:
"""Search while collecting latency metrics."""
if query.ndim == 1:
query = query.reshape(1, -1)

t0 = time.perf_counter()
scores, indices = self.index.search(query, k)
latency_ms = (time.perf_counter() - t0) * 1000

self.latency_window.append(latency_ms)

# Periodically check recall (expensive - only 1% of queries)
if check_recall and len(self.latency_window) % 100 == 0:
_, exact_indices = self.exact_index.search(query, k)
recall = len(set(indices[0]) & set(exact_indices[0])) / k
self.recall_window.append(recall)

return scores, indices

def get_metrics(self) -> dict:
"""Compute current performance metrics."""
if not self.latency_window:
return {}

latencies = list(self.latency_window)
metrics = {
"p50_latency_ms": statistics.median(latencies),
"p95_latency_ms": np.percentile(latencies, 95),
"p99_latency_ms": np.percentile(latencies, 99),
"mean_latency_ms": statistics.mean(latencies),
"query_count": len(latencies),
}

if self.recall_window:
metrics["mean_recall_at_k"] = np.mean(list(self.recall_window))
metrics["recall_samples"] = len(self.recall_window)

# Alerts
metrics["latency_alert"] = metrics["p95_latency_ms"] > self.latency_alert_ms
if "mean_recall_at_k" in metrics:
metrics["recall_alert"] = metrics["mean_recall_at_k"] < self.recall_alert_threshold

return metrics

def report(self):
"""Print a formatted performance report."""
m = self.get_metrics()
print(f"ANN Index Performance ({m.get('query_count', 0)} queries):")
print(f" P50 latency: {m.get('p50_latency_ms', 0):.1f}ms")
print(f" P95 latency: {m.get('p95_latency_ms', 0):.1f}ms")
print(f" P99 latency: {m.get('p99_latency_ms', 0):.1f}ms")
if "mean_recall_at_k" in m:
print(f" Recall@10: {m['mean_recall_at_k']:.3f}")
if m.get("latency_alert"):
print(" ALERT: P95 latency exceeds threshold!")
if m.get("recall_alert"):
print(" ALERT: Recall falling below threshold!")

Summary: Choosing Your ANN Algorithm

The ANN algorithm decision tree for RAG systems is straightforward:

  1. Start with HNSW. It is the best-performing algorithm for most RAG workloads under 50M vectors with adequate RAM. Use M=32 and efConstruction=128 for building; tune efSearch between 32 and 128 to hit your recall/latency target.

  2. Switch to IVF-PQ only if memory is the binding constraint. At 100M+ vectors, HNSW may require more RAM than you have. IVF-PQ compresses 64x at modest recall cost.

  3. Use DiskANN for billion-scale datasets. If your corpus is too large to fit even compressed vectors in RAM, DiskANN enables NVMe-resident indexes with acceptable latency.

  4. Always measure recall@10 vs your exact-search baseline before deploying any ANN index to production. A 90% recall index means 10% of queries miss a relevant document that should have been retrieved.

  5. Monitor in production. Query latency and recall can degrade as the index accumulates updates. Schedule periodic index rebuilds during low-traffic windows.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the RAG Pipeline demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.