Skip to main content

:::tip 🎮 Interactive Playground Visualize this concept: Try the FAISS Index Types demo on the EngineersOfAI Playground - no code required. :::

Embedding Stores

4.2 Seconds to 3 Milliseconds

A recommendation system needed to answer one question: given a user's recent viewing history, which 10 items from a catalog of 10 million should be shown next? The team had trained a two-tower neural network. User history was encoded into a 256-dimensional embedding. Each of the 10 million catalog items also had a 256-dimensional embedding. Finding the 10 most similar items meant finding the 10 catalog embeddings with the highest cosine similarity to the user embedding.

The naive implementation: compute cosine similarity between the user embedding and all 10 million item embeddings, sort, return the top 10. On a single CPU core, this takes approximately 4.2 seconds. With 8 cores and vectorized NumPy operations, it drops to about 0.6 seconds. Still 6x the 100ms serving SLA.

The solution was approximate nearest neighbor (ANN) search using an HNSW index built with Faiss. HNSW doesn't scan all 10 million items - it navigates a hierarchical graph structure to find approximate neighbors. The result: 10 approximate neighbors retrieved in 3 milliseconds, with 95% recall (meaning 95% of the true top-10 items are found). The 3ms is 33x faster than the 100ms budget. The 5% miss rate is invisible to users - the difference between rank 9 and rank 11 in a recommendation list is not perceptible.

This is the trade-off that makes embedding stores possible at scale: approximate is good enough, and approximate is orders of magnitude faster than exact.


Why Exact Search Doesn't Scale

The cost of exact similarity search over NN items with dd-dimensional embeddings is:

Costexact=O(Nd)\text{Cost}_\text{exact} = O(N \cdot d)

For a single query:

  • Compute NN dot products (or cosine similarities)
  • Each dot product: dd multiplications + dd additions
  • Sort NN scores to find top-kk
NNddOperationsApprox time (1 CPU)
100K12812.8M5ms
1M256256M80ms
10M2562.56B800ms
100M51251.2B16s

Above ~1M items, exact search cannot fit within a 100ms serving SLA on commodity hardware. At 10M items, even GPU-accelerated exact search struggles. ANN algorithms trade a small amount of recall for orders-of-magnitude faster query time.


Approximate Nearest Neighbor Algorithms

HNSW - Hierarchical Navigable Small World

HNSW (Malkov & Yashunin, 2016) is the dominant ANN algorithm for in-memory search. It builds a multi-layer graph where each node (embedding) is connected to its approximate neighbors. The top layers have few, long-range connections; the bottom layers have many, short-range connections - like navigating a city by going from highway to street to alley.

Query: start at the entry point in the top layer, greedily navigate toward the query embedding, descend to lower layers for refinement. This is approximately O(logN)O(\log N) per query.

Recall-speed trade-off parameters:

  • M: number of bidirectional connections per node (16–64). Higher M = better recall, more memory
  • ef_construction: beam search width during index build (100–500). Higher = better quality index, slower build
  • ef_search: beam search width during query (10–200). Higher = better recall, slower query
import faiss
import numpy as np

def build_hnsw_index(embeddings: np.ndarray, M: int = 32, ef_construction: int = 200) -> faiss.Index:
"""
Build an HNSW index for approximate nearest neighbor search.

embeddings: (N, d) float32 array
M: connections per node - higher = better recall, more memory
ef_construction: search width during build - higher = better quality
"""
d = embeddings.shape[1]
index = faiss.IndexHNSWFlat(d, M)
index.hnsw.efConstruction = ef_construction
index.add(embeddings.astype(np.float32))
return index


def query_hnsw(index: faiss.Index, query: np.ndarray, k: int = 10, ef_search: int = 50):
"""
Find k approximate nearest neighbors for a query embedding.

Returns (distances, indices) - same as faiss search API.
"""
index.hnsw.efSearch = ef_search
query_2d = query.reshape(1, -1).astype(np.float32)
distances, indices = index.search(query_2d, k)
return distances[0], indices[0]

IVF - Inverted File Index

IVF clusters embeddings into Voronoi cells using k-means. At query time, it searches only the cells closest to the query - reducing the search space from NN to approximately N/ncellsN/n_{\text{cells}}.

def build_ivf_index(embeddings: np.ndarray, n_cells: int = 1000) -> faiss.Index:
"""
Build an IVF (Inverted File) index. Better for very large datasets (>10M).
Requires training on a sample of embeddings before adding all vectors.
"""
d = embeddings.shape[1]
quantizer = faiss.IndexFlatL2(d) # Flat index for cell centroids
index = faiss.IndexIVFFlat(quantizer, d, n_cells)

# Train the index on a sample (or all) of embeddings
index.train(embeddings.astype(np.float32))
index.add(embeddings.astype(np.float32))
return index

LSH - Locality-Sensitive Hashing

LSH hashes similar embeddings to the same bucket with high probability. Simpler than HNSW but lower recall (70–80% vs. 90–95%). Good for large-scale approximate deduplication, not for high-recall recommendation.


The Full Production Embedding Store

This builds a Faiss HNSW index for 1M items and serves it via FastAPI, including index persistence, hot-swap for updates, and latency measurement.

# embedding_store/server.py

import faiss
import numpy as np
import pickle
import time
import threading
import logging
from pathlib import Path
from typing import List, Tuple, Optional, Dict, Any
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel

logger = logging.getLogger(__name__)


class EmbeddingIndex:
"""
Thread-safe HNSW embedding index with hot-swap support.
Supports atomic swap from old index to new index without downtime.
"""

def __init__(self, d: int, M: int = 32, ef_construction: int = 200):
self.d = d
self.M = M
self.ef_construction = ef_construction
self._index: Optional[faiss.Index] = None
self._id_map: Optional[np.ndarray] = None # Map internal Faiss IDs to external item IDs
self._lock = threading.RLock()
self.version = 0
self.built_at = 0.0

def build(self, embeddings: np.ndarray, item_ids: List[str]) -> None:
"""
Build a new HNSW index from embeddings.
Thread-safe: builds in a background thread, swaps atomically on completion.
"""
assert embeddings.shape[0] == len(item_ids), "embeddings and item_ids must have same length"
assert embeddings.shape[1] == self.d, f"Expected d={self.d}, got {embeddings.shape[1]}"

logger.info(f"Building HNSW index for {len(item_ids)} items...")
t0 = time.monotonic()

new_index = faiss.IndexHNSWFlat(self.d, self.M)
new_index.hnsw.efConstruction = self.ef_construction
new_index.add(embeddings.astype(np.float32))
new_id_map = np.array(item_ids)

build_time = time.monotonic() - t0
logger.info(f"Index built in {build_time:.1f}s for {len(item_ids)} items")

# Atomic swap
with self._lock:
self._index = new_index
self._id_map = new_id_map
self.version += 1
self.built_at = time.time()

def search(
self,
query: np.ndarray,
k: int = 10,
ef_search: int = 50,
) -> List[Tuple[str, float]]:
"""
Search for k approximate nearest neighbors.
Returns list of (item_id, distance) tuples.
"""
with self._lock:
if self._index is None:
raise RuntimeError("Index not built yet")
index = self._index
id_map = self._id_map

index.hnsw.efSearch = ef_search
query_2d = query.reshape(1, -1).astype(np.float32)

start = time.monotonic()
distances, indices = index.search(query_2d, k)
latency_ms = (time.monotonic() - start) * 1000

logger.debug(f"ANN search k={k}: {latency_ms:.2f}ms")

results = []
for dist, idx in zip(distances[0], indices[0]):
if idx == -1: # Faiss returns -1 for not found
continue
results.append((id_map[idx], float(dist)))

return results

def save(self, path: str) -> None:
"""Persist index and ID map to disk."""
with self._lock:
if self._index is None:
return
faiss.write_index(self._index, f"{path}.index")
np.save(f"{path}.ids.npy", self._id_map)
logger.info(f"Index saved to {path}")

def load(self, path: str) -> None:
"""Load index and ID map from disk."""
index = faiss.read_index(f"{path}.index")
id_map = np.load(f"{path}.ids.npy", allow_pickle=True)
with self._lock:
self._index = index
self._id_map = id_map
self.version += 1
self.built_at = time.time()
logger.info(f"Index loaded from {path}: {len(id_map)} items")


# FastAPI serving layer
app = FastAPI(title="Embedding Store")
item_index = EmbeddingIndex(d=256, M=32, ef_construction=200)


class SearchRequest(BaseModel):
query_embedding: List[float]
k: int = 10
ef_search: int = 50


class SearchResponse(BaseModel):
results: List[Dict[str, Any]]
latency_ms: float
index_version: int


@app.post("/search", response_model=SearchResponse)
def search(request: SearchRequest):
if len(request.query_embedding) != 256:
raise HTTPException(400, f"Expected 256-d embedding, got {len(request.query_embedding)}")

query = np.array(request.query_embedding, dtype=np.float32)
t0 = time.monotonic()

try:
neighbors = item_index.search(query, k=request.k, ef_search=request.ef_search)
except RuntimeError as e:
raise HTTPException(503, str(e))

latency_ms = (time.monotonic() - t0) * 1000
return SearchResponse(
results=[{"item_id": iid, "distance": dist} for iid, dist in neighbors],
latency_ms=latency_ms,
index_version=item_index.version,
)


@app.get("/health")
def health():
return {
"status": "ok" if item_index._index is not None else "not_ready",
"index_version": item_index.version,
"built_at": item_index.built_at,
}

Measuring Recall vs. Latency Trade-off

The ef_search parameter directly controls the recall-latency trade-off. Higher ef_search means the search explores more candidates before returning - better recall, more time.

import faiss
import numpy as np
import time

def benchmark_recall_latency(
n_items: int = 100_000,
d: int = 256,
M: int = 32,
k: int = 10,
n_queries: int = 100,
):
"""
Benchmark recall vs. latency for different ef_search values.
Uses a flat (exact) index as ground truth.
"""
# Generate random embeddings
rng = np.random.default_rng(42)
embeddings = rng.standard_normal((n_items, d)).astype(np.float32)
queries = rng.standard_normal((n_queries, d)).astype(np.float32)

# Build exact index (ground truth)
exact_index = faiss.IndexFlatL2(d)
exact_index.add(embeddings)
_, exact_results = exact_index.search(queries, k)

# Build HNSW index
hnsw_index = faiss.IndexHNSWFlat(d, M)
hnsw_index.hnsw.efConstruction = 200
hnsw_index.add(embeddings)

print(f"{'ef_search':>10} | {'p99 latency (ms)':>18} | {'recall@10':>12}")
print("-" * 50)

for ef in [10, 20, 50, 100, 200]:
hnsw_index.hnsw.efSearch = ef
latencies = []
recalls = []

for i, q in enumerate(queries):
t0 = time.monotonic()
_, hnsw_result = hnsw_index.search(q.reshape(1, -1), k)
latency = (time.monotonic() - t0) * 1000
latencies.append(latency)

# Recall: fraction of true top-k that HNSW found
true_set = set(exact_results[i])
hnsw_set = set(hnsw_result[0])
recalls.append(len(true_set & hnsw_set) / k)

p99 = np.percentile(latencies, 99)
mean_recall = np.mean(recalls)
print(f"{ef:>10} | {p99:>18.2f} | {mean_recall:>12.3f}")


benchmark_recall_latency()

Typical output for 100K items, d=256, M=32:

ef_search | p99 latency (ms) | recall@10
--------------------------------------------------
10 | 0.18 | 0.823
20 | 0.26 | 0.912
50 | 0.44 | 0.963
100 | 0.72 | 0.981
200 | 1.21 | 0.994

For most recommendation use cases, ef_search=50 offers a good balance: 96.3% recall at 0.44ms p99.


pgvector extends PostgreSQL with vector similarity search. It is the right choice when:

  • Your catalog is already in Postgres and you don't want another system
  • Scale is under 10M items (HNSW performance degrades at very large N)
  • You need SQL joins (filter by category, price range) combined with vector similarity
-- Install extension
CREATE EXTENSION vector;

-- Table with embedding column
CREATE TABLE item_embeddings (
item_id TEXT PRIMARY KEY,
category TEXT,
price NUMERIC,
embedding vector(256)
);

-- Build HNSW index
CREATE INDEX ON item_embeddings
USING hnsw (embedding vector_cosine_ops)
WITH (m = 32, ef_construction = 200);

-- Query: find 10 nearest neighbors to a query embedding, filtered by category
SELECT item_id, price, 1 - (embedding <=> '[0.12, -0.34, ...]'::vector) AS similarity
FROM item_embeddings
WHERE category = 'electronics'
AND price < 500
ORDER BY embedding <=> '[0.12, -0.34, ...]'::vector
LIMIT 10;
import psycopg2
import numpy as np

conn = psycopg2.connect("postgresql://user:pass@localhost/featuredb")

def insert_embedding(item_id: str, embedding: np.ndarray, category: str, price: float):
with conn.cursor() as cur:
cur.execute(
"INSERT INTO item_embeddings (item_id, category, price, embedding) "
"VALUES (%s, %s, %s, %s) ON CONFLICT (item_id) DO UPDATE SET embedding = EXCLUDED.embedding",
(item_id, category, price, embedding.tolist())
)
conn.commit()

def find_similar_items(
query_embedding: np.ndarray,
category: str = None,
max_price: float = None,
k: int = 10,
) -> list:
conditions = []
params = []

if category:
conditions.append("category = %s")
params.append(category)
if max_price:
conditions.append("price < %s")
params.append(max_price)

where_clause = ("WHERE " + " AND ".join(conditions)) if conditions else ""
query_vec = query_embedding.tolist()

sql = f"""
SELECT item_id, price,
1 - (embedding <=> %s::vector) AS similarity
FROM item_embeddings
{where_clause}
ORDER BY embedding <=> %s::vector
LIMIT %s
"""
params_full = [query_vec] + params + [query_vec, k]

with conn.cursor() as cur:
cur.execute(sql, params_full)
return cur.fetchall()

Embedding Freshness and Index Updates

Product embeddings must be updated as the catalog changes: new items are added, old items are removed, seasonal relevance shifts. Two update patterns:

Full rebuild (daily): rebuild the entire index from all current embeddings. Simple, no incremental complexity. The hot-swap pattern: build the new index in the background, swap atomically when complete. The old index serves queries during the build. Appropriate when the catalog changes less than 10% daily.

Incremental updates (continuous): add new embeddings to an existing index (index.add()), remove stale embeddings. Faiss HNSW supports addition but not deletion - deleted items remain in the index and may appear in results. Workaround: maintain a "deleted set" and filter results post-search.

class IncrementalEmbeddingIndex:
"""HNSW index with soft-delete support."""

def __init__(self, index: faiss.Index, id_map: np.ndarray):
self.index = index
self.id_map = list(id_map)
self.deleted_ids = set()

def add_item(self, item_id: str, embedding: np.ndarray):
self.index.add(embedding.reshape(1, -1).astype(np.float32))
self.id_map.append(item_id)

def remove_item(self, item_id: str):
self.deleted_ids.add(item_id) # Soft delete

def search(self, query: np.ndarray, k: int = 10, overquery: int = 3) -> list:
"""Search with overquery to account for deleted items in results."""
_, indices = self.index.search(query.reshape(1, -1).astype(np.float32), k * overquery)
results = []
for idx in indices[0]:
if idx == -1:
continue
item_id = self.id_map[idx]
if item_id not in self.deleted_ids:
results.append(item_id)
if len(results) == k:
break
return results

Managed Vector Databases

DatabaseANN AlgorithmScaleFilteringManagedBest For
PineconeProprietary1B+Metadata filterYes (cloud-only)Teams that want zero infrastructure
WeaviateHNSW100M+GraphQL, BM25 hybridYes / Self-hostedSemantic search + keyword hybrid
QdrantHNSW100M+Rich filter APIYes / Self-hostedRecommendation, RAG
MilvusHNSW, IVF, DISKANN1B+Structured filterYes / Self-hostedVery large scale
ChromaHNSW (Faiss)10MMetadata filterSelf-hostedDevelopment, small production

Choosing between managed and self-hosted: managed services (Pinecone) are appropriate when the team lacks ML infrastructure expertise and can accept the cost premium. Self-hosted (Faiss + FastAPI, or Qdrant/Milvus on Kubernetes) is appropriate when scale requirements demand custom tuning or when data residency requirements prevent using external services.


Embedding Store Architecture

:::tip Cache ANN Results for Hot Entities For the top 1–5% of users who generate a disproportionate fraction of traffic (power users, bots, popular accounts), cache the ANN search results in Redis with a 5-minute TTL. ANN searches are cheap (3ms) but cache hits eliminate the Faiss index lock contention under very high concurrency. For 10,000 requests per second, caching saves 30 seconds of Faiss CPU per second for hot entities. :::


Interview Q&A

Q: Explain the recall-latency trade-off in approximate nearest neighbor search. How do you choose the right operating point?

The recall-latency trade-off is governed by the search width parameters (ef_search for HNSW, nprobe for IVF). Higher search width examines more candidates before returning, improving the probability that the true top-k items are found (recall), but taking more time. The optimal operating point depends on the use case: for recommendation, 90–95% recall@10 is typically sufficient - users don't notice the difference between item rank 9 and rank 11. For fraud detection using embedding similarity, 99%+ recall may be required. Benchmark at your actual N, d, and k - don't trust generic benchmarks. Find the ef_search that gives acceptable recall, then measure the p99 latency and verify it fits in your latency budget.

Q: How do you update an embedding index when new items are added to the catalog?

Two strategies: (1) Daily full rebuild - rebuild the entire HNSW index from all current embeddings, including new ones, in a background job. Atomically swap the new index into the serving path when the build completes (the hot-swap pattern). Simple, no incremental complexity, but all new items are invisible until the next rebuild. (2) Incremental add - Faiss HNSW supports index.add() to append new embeddings to an existing index. New items become immediately searchable. Limitation: HNSW doesn't support deletion; removed items require a soft-delete set and post-search filtering. Incremental add works well when the catalog grows rapidly and a 24-hour delay for new items is unacceptable.

Q: What is hybrid search, and why is it necessary for most production search systems?

Hybrid search combines semantic similarity (embedding-based) with exact structured filters (price range, category, availability, geographic proximity). Pure semantic search returns items that are conceptually similar to the query, but may include items that are out of stock, wrong category, or too expensive. Structured filters alone miss semantic intent. Hybrid search applies both: use the vector index to find top-K semantically similar items (with some over-querying to account for filter rejection), then apply exact filters to the candidates. In pgvector, this is a WHERE clause combined with ORDER BY embedding distance. In Pinecone and Qdrant, it is a metadata filter parameter on the search call. The filter-then-ANN approach is dangerous: filtering first reduces the search space in ways that can cause the ANN to miss relevant items.

Q: When would you choose Faiss over a managed vector database like Pinecone?

Choose Faiss when: you need maximum control over indexing parameters (M, ef_construction, quantization), you have data residency or security requirements that prevent using an external service, your team has ML infrastructure expertise and can maintain a custom service, or you need to minimize cost at large scale (Pinecone is expensive per vector at billion scale). Choose Pinecone (or another managed service) when: the team doesn't have infrastructure bandwidth, you need multi-region managed availability, or you need to be operational quickly and can accept the cost. The operational cost of maintaining a Faiss service (index builds, serving infrastructure, monitoring, upgrades) is non-trivial - it is a full ML infrastructure component.

Q: A recommendation system has 10M item embeddings. How would you design the index build and serving pipeline?

Build pipeline: generate item embeddings offline (batch job, GPU cluster, writes to S3/Iceberg). Daily index build job: read all 10M embeddings from S3, build Faiss HNSW index (M=32, ef_construction=200, takes ~30 minutes for 10M items), write index file to S3. Serving pipeline: N embedding server instances (e.g., 4 servers for HA), each loads the HNSW index into memory at startup (256d × 10M × 4 bytes = 10GB, fits on a 16GB instance). Each server receives ANN search requests via gRPC. Daily index swap: servers poll S3 for a new index version, build the new index in a background thread, swap atomically. Load balancer distributes queries across servers. For queries: retrieve the user embedding from the feature store (Redis), call any embedding server via gRPC, receive top-100 candidates, apply business rules (dedup, diversity, filter sold-out items), return top-10.

© 2026 EngineersOfAI. All rights reserved.