Design: Visual Search - Embedding Models and Nearest Neighbor Search

Reading time: ~22 min | Interview relevance: High | Roles: MLE

The Real Interview Moment

"Design a visual search system where users take a photo and find similar products." You describe extracting features with a CNN. The interviewer asks: "You have 100M product images. How do you search through them in under 100ms? A brute-force comparison over 100M embeddings takes 10 seconds."

Visual search tests your understanding of embedding spaces, approximate nearest neighbor (ANN) algorithms, and the trade-off between search accuracy and latency at scale.

What You Will Master

Image embedding models for visual similarity
ANN algorithms: HNSW, IVF, product quantization
Cross-modal search: image → text, text → image
Indexing and serving at 100M+ scale
Relevance feedback and online learning

The Complete Design

Step 1: Requirements (5 min)

Functional requirements:

Upload a photo → find visually similar products
Text query → find matching product images
Filter results by category, price, availability
100M product images, 10M searches/day

Non-functional requirements:

Latency: <200ms end-to-end (including image processing)
Relevance: Top-5 results are relevant 80%+ of the time
Freshness: New products searchable within 1 hour
Index size: Must fit in memory (or close to it) for speed

Step 2: Problem Formulation (5 min)

ML problem type: Metric learning + approximate nearest neighbor search.

The core idea: Map images (and optionally text) into a shared embedding space where similar items are close together. Then, given a query image, find the nearest neighbors in that space.

Visual Search Pipeline - Query Image → Encoder → Embedding → ANN Index → Top K Results

Step 3: Embedding Model (8 min)

Model Options

Model	Approach	Embedding Dim	Best For
ResNet + Contrastive Loss	Train on product image pairs	256-512	Visual similarity only
CLIP	Pre-trained image-text alignment	512-768	Cross-modal (image ↔ text)
DINO v2	Self-supervised vision transformer	768	General visual features
Fine-tuned CLIP	CLIP + fine-tune on product data	512	Best for product search

Recommendation: Start with CLIP (zero-shot cross-modal search), then fine-tune on your product catalog with triplet or contrastive loss.

Training for Visual Similarity

Triplet loss: Given an anchor image, a positive (similar product), and a negative (different product):

Loss = max(0, d(anchor, positive) - d(anchor, negative) + margin)

Hard negative mining: The most important training technique - select negatives that are close to the anchor but from a different category. A black dress vs. a black jacket is a harder negative than a black dress vs. a red car.

Common Trap

Don't use random negatives for training - the model learns nothing from trivially different pairs. Use semi-hard negatives (close but different category) for the best training signal. In the interview, mentioning hard negative mining shows you understand metric learning beyond the textbook.

Step 4: ANN Indexing (8 min)

Why Not Brute Force?

100M vectors × 256 dimensions × 4 bytes = ~100 GB. Brute-force cosine similarity: O(N × D) per query. At 100M vectors: ~10 seconds per query. We need <10ms.

ANN Algorithms

Algorithm	How It Works	Recall@10	Latency	Memory
HNSW	Hierarchical graph-based search	99%	1ms	100% of vectors
IVF-PQ	Cluster + product quantization	90-95%	0.5ms	10-25% of vectors
ScaNN	Anisotropic quantization	95%	0.3ms	15% of vectors
FAISS (IVF-HNSW-PQ)	Hybrid approach	97%	1ms	20% of vectors

Key trade-off: Recall vs. latency vs. memory.

HNSW: Best recall, highest memory
IVF-PQ: Most memory-efficient, lower recall
Hybrid: Best balance

Recommendation: HNSW for <10M vectors (fits in memory). IVF-PQ or ScaNN for 100M+ vectors.

Vector Database Options

Database	Strengths	Deployment
FAISS	Fastest, most flexible, open-source	Self-hosted (library, not a service)
Qdrant	Filtering support, easy deployment	Self-hosted or cloud
Pinecone	Fully managed, easy to use	Cloud only
Milvus	Distributed, large-scale	Self-hosted

Step 5: Serving (8 min)

Visual Search Serving - Upload → Preprocess (5ms) → Encode (20ms) → ANN Search (5ms) → Filter → Re-Rank → Results

With CLIP, you get image-to-text and text-to-image search for free:

Image query: Encode image → search image embedding index
Text query: Encode text → search same image embedding index (shared space)
Hybrid query: "Red dress like this" (text + image) → combine embeddings

Index Updates

Challenge	Solution
New products	Add to index in real-time (HNSW supports incremental inserts)
Updated images	Re-embed and update vector
Deleted products	Remove from index + filter at query time
Full re-index	Nightly batch rebuild with latest model

Step 6: Evaluation (5 min)

Offline Metrics

Metric	What It Measures
Recall@K	% of relevant items in top K results
Precision@K	% of top K results that are relevant
MRR	Average reciprocal rank of first relevant result

Online Metrics

Click-through rate: Do users click on search results?
Purchase-through rate: Do users buy from visual search?
Query abandonment: Do users give up after seeing results?

Practice Problems

Problem 1: Duplicate Product Detection

Direction

Sellers on your marketplace upload the same product with different photos. How do you detect duplicates?

Key Insight

Near-duplicate detection: Use image embeddings + a similarity threshold. If two products have embedding similarity > 0.95, flag as potential duplicates. But images can be the same product from different angles - combine with text similarity (title, description) and structured features (price, brand). Use a classifier: given two products, predict P(duplicate) using image similarity, text similarity, and feature overlap.

Problem 2: Fine-Grained Fashion Search

Direction

A user uploads a photo of a red floral dress and wants to find similar dresses - same style but different colors/patterns are fine. How do you handle attributes?

Key Insight

Disentangled embeddings: Train the model to separate style from color/pattern. Use attribute-aware contrastive learning: same-style-different-color pairs are positives. At search time, match on the style component of the embedding while allowing variation in color/pattern. Alternatively: extract attributes (category, style, sleeve length, neckline) and allow users to filter results by specific attributes.

Interview Cheat Sheet

Question Pattern	Framework	Key Phrases
"Design visual search"	Embed → index → search	"CLIP embeddings, HNSW/FAISS index, sub-10ms ANN search"
"How do you scale to 100M?"	ANN algorithms	"Product quantization reduces memory 4-8x, HNSW gives 99% recall at 1ms"
"Image + text search?"	Cross-modal embeddings	"CLIP maps images and text to shared space - search either modality"

Spaced Repetition Checkpoints

Day 0: Explain the visual search pipeline from memory. What's the embedding → ANN → results flow?
Day 3: Compare HNSW vs. IVF-PQ. When would you use each?
Day 7: Design visual search for a furniture marketplace in 45 minutes.
Day 14: Explain hard negative mining and why it matters for metric learning.
Day 21: Mock interview with follow-ups on cross-modal search and scaling to 1B images.

What's Next

Anomaly Detection - Unsupervised methods for detecting unusual patterns
Machine Translation - Sequence-to-sequence at scale

The Real Interview Moment​

What You Will Master​

The Complete Design​

Step 1: Requirements (5 min)​

Step 2: Problem Formulation (5 min)​

Step 3: Embedding Model (8 min)​

Model Options​

Training for Visual Similarity​

Step 4: ANN Indexing (8 min)​

Why Not Brute Force?​

ANN Algorithms​

Vector Database Options​

Step 5: Serving (8 min)​

Cross-Modal Search​

Index Updates​

Step 6: Evaluation (5 min)​

Offline Metrics​

Online Metrics​

Practice Problems​

Problem 1: Duplicate Product Detection​

Problem 2: Fine-Grained Fashion Search​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

What's Next​