:::tip 🎮 Interactive Playground Visualize this concept: Try the Vector Search Explorer demo on the EngineersOfAI Playground - no code required. :::
Metadata Filtering with Vector Search
The Multi-Tenant Disaster
A SaaS company builds their AI-powered document search on Pinecone. Their data model is simple: each customer's documents are tagged with a customer_id metadata field. Every search query includes a filter {"customer_id": {"$eq": "customer_abc"}} to ensure customers only see their own data.
For the first six months, it works perfectly. Then two problems emerge simultaneously.
First: a customer with 500,000 documents reports that search quality has dramatically degraded. Investigation reveals that as their document count grew, the metadata filter was eliminating 99.8% of the HNSW candidates retrieved during search, leaving only a handful of results that did not necessarily contain the most relevant documents. Recall had dropped from 0.91 to 0.43 without anyone noticing.
Second: a security audit identifies a theoretical risk - in a heavily-loaded scenario, the post-filtering step could return zero results, and a fallback to unfiltered search had been accidentally introduced in a code review six weeks prior. For a brief window, cross-tenant document leakage was possible.
Both problems stem from the same root cause: building multi-tenancy on top of metadata filtering without understanding how different databases implement filtering internally and what the performance and correctness guarantees actually are.
Why Filtering Is Non-Trivial
Adding a metadata filter to a vector query sounds like a minor feature. In practice, it fundamentally changes the search algorithm because HNSW graph traversal has no concept of "skip this node." When you traverse the graph toward the nearest neighbor, you visit intermediate nodes. If those nodes do not pass the filter, you cannot simply ignore them - the graph structure was built without regard for your filter.
There are three architectural approaches to solving this, and each has completely different performance characteristics.
Approach 1: Post-Filtering
Post-filtering is the simplest approach and the one most databases used initially.
- Run ANN search normally, retrieving the top candidates (where )
- Apply the metadata filter to the candidates
- Return the top after filtering
The problem: If the filter eliminates 95% of candidates, you need candidates from ANN to get results after filtering. Retrieving 200 candidates when you only need 10 requires the ANN algorithm to explore much more of the graph - significantly higher latency. Worse: if the filter eliminates 99.5% of candidates and you only retrieve 100 candidates from ANN, you get fewer than the requested results. Recall is not just degraded - it is uncontrollably dependent on the filter selectivity.
def post_filter_search_demo(
index, # ANN index
query_vector: list,
metadata_filter_fn, # callable(doc_metadata) -> bool
k: int = 10,
oversample_factor: int = 10, # retrieve k * this many candidates
) -> list:
"""
Post-filtering: retrieve candidates, then filter.
oversample_factor must increase as filter becomes more selective.
"""
# Retrieve more candidates than needed to compensate for filtering
n_candidates = k * oversample_factor
candidates = index.search(query_vector, k=n_candidates)
# Apply filter
filtered = [c for c in candidates if metadata_filter_fn(c.metadata)]
# Problem: if filter is very selective, we may still get fewer than k results
if len(filtered) < k:
print(f"WARNING: Only {len(filtered)} results after filtering (requested {k})")
print("Recall is degraded. Increase oversample_factor or use pre-filtering.")
return filtered[:k]
Approach 2: Pre-Filtering (Brute Force on Filtered Subset)
Pre-filtering inverts the order: first identify all documents that pass the filter, then run vector search only within that subset.
- Retrieve all document IDs that pass the metadata filter (using an inverted index on metadata)
- Run ANN search restricted to those IDs
The problem with naive pre-filtering: if the filtered subset is large (e.g., 50% of the collection passes the filter), you are running vector search over a large set, which is expensive. If the subset is small (1% of collection), you must run exact search over that subset - there are not enough vectors to build a meaningful ANN index.
Pre-filtering works well for namespace-based isolation (each customer has their own namespace): there is a dedicated index per namespace with no filtering overhead. This is how Pinecone's original multi-tenancy model works.
Approach 3: Integrated Filtering (ACORN)
ACORN (Filtered HNSW with Adaptive Candidate Generation, Peng et al., 2023) integrates the filter directly into the HNSW traversal. Instead of visiting every node in the graph, ACORN only visits nodes that pass the filter while still maintaining graph connectivity for navigation.
The key insight: even if intermediate nodes in the HNSW graph do not pass the filter, they can still be used as navigation waypoints. ACORN uses a two-hop neighborhood: when evaluating node , it checks and all of 's neighbors for filter compliance, using non-compliant neighbors only for navigation, not as result candidates.
This maintains recall even at 99.9% filter selectivity with minimal overhead compared to unfiltered search. Qdrant implements ACORN natively and holds the top positions on ANN-Benchmarks for filtered search scenarios.
Multi-Tenancy Architecture Patterns
Pattern 1: Separate Collection Per Tenant
Every tenant gets their own collection (or Pinecone index). Filtering is zero-overhead because the index only contains that tenant's documents. This is the cleanest isolation model.
Pros: Perfect isolation, no cross-tenant data risk, each tenant can have different index parameters. Cons: Collection creation overhead, management complexity at 1000+ tenants, cold-start issues for small tenants (ANN index needs minimum document count to be effective).
Right choice when: you have a small number of large tenants (under 100 tenants, each with 100K+ documents).
Pattern 2: Shared Index with Namespace/Partition Filtering
All tenants share one index. Each document has a tenant_id payload field. Every query includes a filter on tenant_id.
Pros: Simple to add new tenants, no per-tenant management overhead. Cons: Filtering performance degrades as number of tenants grows (each query filters out more of the index), requires ACORN-style filtering for good recall.
Right choice when: you have many tenants (1000+) with small per-tenant document counts where separate collections would be wasteful.
Pattern 3: Sharded Index by Tenant Segment
Group tenants into buckets (e.g., by organization size or region). Each bucket gets a collection containing all documents for all tenants in that bucket. Queries include the tenant ID filter within the bucket.
This is a compromise: you have fewer collections than pattern 1 but better filter selectivity than pattern 2.
from qdrant_client import QdrantClient
from qdrant_client.models import (
Distance, VectorParams, PointStruct,
Filter, FieldCondition, MatchValue,
PayloadSchemaType, TextIndexParams,
)
import numpy as np
class MultiTenantVectorStore:
"""
Multi-tenant vector store using Qdrant with payload filtering.
Uses ACORN-style filtering for good recall even with selective filters.
"""
def __init__(self, qdrant_url: str = "http://localhost:6333"):
self.client = QdrantClient(url=qdrant_url)
def setup_collection(
self,
collection_name: str,
vector_dim: int = 768,
) -> None:
"""Create collection with indexed payload fields for efficient filtering."""
self.client.recreate_collection(
collection_name=collection_name,
vectors_config=VectorParams(
size=vector_dim,
distance=Distance.COSINE,
),
)
# Create payload index on tenant_id for fast pre-filtering
# This is critical - without it, Qdrant must scan all payloads
self.client.create_payload_index(
collection_name=collection_name,
field_name="tenant_id",
field_schema=PayloadSchemaType.KEYWORD,
)
self.client.create_payload_index(
collection_name=collection_name,
field_name="doc_type",
field_schema=PayloadSchemaType.KEYWORD,
)
self.client.create_payload_index(
collection_name=collection_name,
field_name="created_at",
field_schema=PayloadSchemaType.INTEGER, # unix timestamp for range queries
)
def upsert_document(
self,
collection_name: str,
doc_id: int,
vector: np.ndarray,
tenant_id: str,
metadata: dict,
) -> None:
"""Insert or update a document with full payload."""
self.client.upsert(
collection_name=collection_name,
points=[
PointStruct(
id=doc_id,
vector=vector.tolist(),
payload={
"tenant_id": tenant_id,
**metadata,
}
)
],
wait=True, # wait=True ensures strong consistency for this write
)
def search(
self,
collection_name: str,
query_vector: np.ndarray,
tenant_id: str,
doc_type_filter: str | None = None,
k: int = 10,
) -> list:
"""
Filtered vector search. All queries are tenant-scoped for data isolation.
"""
must_conditions = [
FieldCondition(
key="tenant_id",
match=MatchValue(value=tenant_id),
)
]
if doc_type_filter:
must_conditions.append(
FieldCondition(
key="doc_type",
match=MatchValue(value=doc_type_filter),
)
)
return self.client.search(
collection_name=collection_name,
query_vector=query_vector.tolist(),
query_filter=Filter(must=must_conditions),
limit=k,
with_payload=True,
)
Designing Metadata Schemas for Search Performance
The metadata schema determines which filters are efficient and which cause full collection scans. Design it with your query patterns in mind.
Payload Index Types
| Data Type | Index Type | Use For |
|---|---|---|
KEYWORD | Hash index | Exact match on categorical fields (tenant_id, doc_type, status) |
INTEGER | Range index | Numeric comparisons (created_at timestamp, price, score) |
FLOAT | Range index | Floating point ranges |
TEXT | Full-text index | Substring/full-text search on content fields |
GEO | Geospatial index | Geographic bounding box queries |
Critical rule: Always create a payload index on any field you filter by. Without an index, Qdrant (and other vector DBs) must scan all payloads in memory to apply the filter. With an index, filter candidates are resolved in O(log n) time.
def analyze_filter_selectivity(
client: QdrantClient,
collection_name: str,
filter_field: str,
filter_value: str,
) -> dict:
"""
Measure how selective a filter is before deploying it to production.
If selectivity > 0.90 (filters out 90%+), validate recall under filtering.
"""
total_count = client.count(collection_name).count
filtered_count = client.count(
collection_name,
count_filter=Filter(
must=[FieldCondition(
key=filter_field,
match=MatchValue(value=filter_value),
)]
)
).count
selectivity = 1 - (filtered_count / total_count)
return {
"total_documents": total_count,
"passing_filter": filtered_count,
"eliminated_pct": selectivity * 100,
"warning": selectivity > 0.90,
"recommendation": (
"HIGH selectivity: validate recall@10 under this filter before production"
if selectivity > 0.90
else "Normal selectivity: standard ANN filtering should maintain recall"
)
}
Performance Impact of Filter Cardinality
Filter selectivity has a counterintuitive performance characteristic: both very high and very low selectivity can cause problems.
Very high selectivity (>99% filtered out): Only a tiny fraction of vectors are valid candidates. Post-filter approach returns almost nothing. ACORN approach must navigate through many invalid nodes to find valid ones. Both approaches require extra work.
Very low selectivity (<1% filtered out): Almost all vectors pass the filter. The filter provides almost no benefit. The overhead of evaluating the filter on every candidate is wasted.
Sweet spot: Filters that keep 5–50% of the collection. These are effective at reducing the search space while leaving enough candidates for HNSW to navigate to good results.
For multi-tenant applications where per-tenant data is 0.1–10% of the collection, use ACORN-style filtering and validate recall at your actual tenant size distribution.
Production Engineering Notes
Test recall under your actual filter distribution. The correct test is not "what is recall@10 without filters?" but "what is recall@10 for each of my top 10 tenants by document count?" A customer with 1% of total documents will see very different recall than a customer with 30% of total documents.
Monitor per-tenant result counts. Alert when any tenant's queries consistently return fewer than the requested results. This is the most reliable early warning of filter-related recall degradation. One missed result is a UX issue; consistently getting 3 results when 10 are requested is a system failure.
Warm payload indexes after large ingestion. After bulk-inserting tens of thousands of documents, payload indexes may need time to optimize. Some vector databases (including Qdrant) do background index optimization. Check index status via the API before declaring the collection ready for production traffic.
Common Mistakes
:::danger Assuming metadata filtering has no performance cost Every metadata filter adds overhead. A filter on an unindexed field turns into a full collection scan. A highly selective filter on an indexed field still requires ACORN-style traversal to maintain recall. Always create payload indexes on filtered fields, always benchmark recall and latency with production-realistic filters, not filters on test data. :::
:::danger Building multi-tenancy without data isolation guarantees Using a single shared index with metadata filtering for multi-tenancy creates a security dependency on the filter being correctly applied on every query. A single code bug where the tenant_id filter is accidentally omitted exposes all tenants' data to any authenticated user. Defense in depth: (a) apply filters at the database level AND at the application level, (b) add integration tests that verify cross-tenant queries return zero results, (c) consider separate collections for high-security requirements. :::
:::warning Using high-cardinality fields as filters without indexes
Filtering on a field like user_id with 1 million distinct values without a keyword payload index causes a full collection scan on every query. Always check that payload indexes exist for all frequently filtered fields. Use client.get_collection() to inspect index status.
:::
:::tip Use namespace sharding for the largest tenants For a SaaS application where 80% of data belongs to your top 10 enterprise customers, give those customers dedicated collections and use the shared filtered approach for the long tail. This gives large customers maximum performance (zero filtering overhead) while keeping management complexity manageable for small tenants. :::
Interview Questions
Q1: Explain the difference between pre-filtering and post-filtering in vector search and when each is appropriate.
Post-filtering retrieves ANN candidates first, then applies the metadata filter. It is simple to implement and adds no overhead to the ANN algorithm, but recall degrades proportionally to filter selectivity - a 99% selective filter requires retrieving 100× more candidates than needed. Pre-filtering identifies valid documents via metadata indexes first, then runs vector search only within that subset. It maintains recall but is expensive when the valid subset is large (requires searching a large space) and problematic when the subset is very small (not enough documents for ANN to work effectively). Pre-filtering works best for namespace isolation where the entire collection IS the valid set for that tenant.
Q2: What is the ACORN algorithm and why does it matter for production vector databases?
ACORN integrates metadata filtering directly into HNSW graph traversal. Standard HNSW visits all nodes during traversal, including those that do not pass the filter. ACORN uses a two-hop neighborhood strategy: when evaluating a node, it checks the node and its graph neighbors for filter compliance. Non-compliant nodes are used only for navigation (to reach other parts of the graph) but not returned as results. This maintains high recall even when filters are extremely selective (99%+ eliminated), with minimal overhead compared to unfiltered HNSW. It matters because without ACORN-style filtering, production systems with multi-tenant data isolation see dramatic recall degradation as tenant dataset size decreases relative to the total collection size.
Q3: You are building a SaaS application with 10,000 customers and need vector search with per-customer data isolation. What architecture do you choose?
For 10,000 customers, separate collections per customer is too expensive to manage. Use a shared collection with payload-indexed customer_id field and ACORN-style filtering (Qdrant). Create a payload index on customer_id as a keyword field - this enables O(log n) filter candidate resolution rather than full scan. Every query must include the customer_id filter at the database query level, plus validation at the application level (defense in depth). Monitor per-customer recall@10 weekly and alert when any customer's average drops below 0.88. For customers with extremely small document counts (under 1000 documents), supplement with exact search as fallback since ANN does not provide meaningful speedup at that scale anyway.
Q4: A customer reports that their search quality degraded significantly after their document count grew from 50K to 500K. What happened?
The filter selectivity changed. At 50K documents (customer) out of total collection (e.g., 2M), the customer's documents were 2.5% of the collection - filter eliminates 97.5%. At 500K documents out of collection (now grown to 10M), the fraction might be similar or worse. If using post-filtering: the system must retrieve 40K candidates to get 1000 relevant ones, which is extremely slow and the HNSW search oversample wasn't scaled accordingly. If using pre-filter without ACORN: the "valid subset" of 500K vectors requires its own high-quality ANN graph, which may not exist. The fix: switch to ACORN-style integrated filtering, validate recall@10 for this customer's document count range, and set up alerts on result counts per tenant.
Q5: What payload fields should always be indexed in a production vector database, and which should not be?
Always index: fields used in must or should filter conditions in every query (e.g., tenant_id, doc_type, status), fields used in range queries (timestamps, prices), fields used in sorting. Never index: fields that are never queried directly (full text content, raw HTML, binary data), fields with very low cardinality where full scan is just as fast (boolean flags where 50% of documents have each value), fields used only in the response payload but not for filtering. Fields with very high cardinality (e.g., embedding IDs) can be indexed if you query them by equality, but typically are looked up by vector ID directly rather than via payload filter.
