Skip to main content

Design: Visual Search - Embedding Models and Nearest Neighbor Search

Reading time: ~22 min | Interview relevance: High | Roles: MLE

The Real Interview Moment

"Design a visual search system where users take a photo and find similar products." You describe extracting features with a CNN. The interviewer asks: "You have 100M product images. How do you search through them in under 100ms? A brute-force comparison over 100M embeddings takes 10 seconds."

Visual search tests your understanding of embedding spaces, approximate nearest neighbor (ANN) algorithms, and the trade-off between search accuracy and latency at scale.

What You Will Master

  • Image embedding models for visual similarity
  • ANN algorithms: HNSW, IVF, product quantization
  • Cross-modal search: image → text, text → image
  • Indexing and serving at 100M+ scale
  • Relevance feedback and online learning

The Complete Design

Step 1: Requirements (5 min)

Functional requirements:

  • Upload a photo → find visually similar products
  • Text query → find matching product images
  • Filter results by category, price, availability
  • 100M product images, 10M searches/day

Non-functional requirements:

  • Latency: <200ms end-to-end (including image processing)
  • Relevance: Top-5 results are relevant 80%+ of the time
  • Freshness: New products searchable within 1 hour
  • Index size: Must fit in memory (or close to it) for speed

Step 2: Problem Formulation (5 min)

ML problem type: Metric learning + approximate nearest neighbor search.

The core idea: Map images (and optionally text) into a shared embedding space where similar items are close together. Then, given a query image, find the nearest neighbors in that space.

Visual Search Pipeline - Query Image → Encoder → Embedding → ANN Index → Top K Results

Step 3: Embedding Model (8 min)

Model Options

ModelApproachEmbedding DimBest For
ResNet + Contrastive LossTrain on product image pairs256-512Visual similarity only
CLIPPre-trained image-text alignment512-768Cross-modal (image ↔ text)
DINO v2Self-supervised vision transformer768General visual features
Fine-tuned CLIPCLIP + fine-tune on product data512Best for product search

Recommendation: Start with CLIP (zero-shot cross-modal search), then fine-tune on your product catalog with triplet or contrastive loss.

Training for Visual Similarity

Triplet loss: Given an anchor image, a positive (similar product), and a negative (different product):

Loss = max(0, d(anchor, positive) - d(anchor, negative) + margin)

Hard negative mining: The most important training technique - select negatives that are close to the anchor but from a different category. A black dress vs. a black jacket is a harder negative than a black dress vs. a red car.

Common Trap

Don't use random negatives for training - the model learns nothing from trivially different pairs. Use semi-hard negatives (close but different category) for the best training signal. In the interview, mentioning hard negative mining shows you understand metric learning beyond the textbook.

Step 4: ANN Indexing (8 min)

Why Not Brute Force?

100M vectors × 256 dimensions × 4 bytes = ~100 GB. Brute-force cosine similarity: O(N × D) per query. At 100M vectors: ~10 seconds per query. We need <10ms.

ANN Algorithms

AlgorithmHow It WorksRecall@10LatencyMemory
HNSWHierarchical graph-based search99%1ms100% of vectors
IVF-PQCluster + product quantization90-95%0.5ms10-25% of vectors
ScaNNAnisotropic quantization95%0.3ms15% of vectors
FAISS (IVF-HNSW-PQ)Hybrid approach97%1ms20% of vectors

Key trade-off: Recall vs. latency vs. memory.

  • HNSW: Best recall, highest memory
  • IVF-PQ: Most memory-efficient, lower recall
  • Hybrid: Best balance

Recommendation: HNSW for <10M vectors (fits in memory). IVF-PQ or ScaNN for 100M+ vectors.

Vector Database Options

DatabaseStrengthsDeployment
FAISSFastest, most flexible, open-sourceSelf-hosted (library, not a service)
QdrantFiltering support, easy deploymentSelf-hosted or cloud
PineconeFully managed, easy to useCloud only
MilvusDistributed, large-scaleSelf-hosted

Step 5: Serving (8 min)

Visual Search Serving - Upload → Preprocess (5ms) → Encode (20ms) → ANN Search (5ms) → Filter → Re-Rank → Results

With CLIP, you get image-to-text and text-to-image search for free:

  • Image query: Encode image → search image embedding index
  • Text query: Encode text → search same image embedding index (shared space)
  • Hybrid query: "Red dress like this" (text + image) → combine embeddings

Index Updates

ChallengeSolution
New productsAdd to index in real-time (HNSW supports incremental inserts)
Updated imagesRe-embed and update vector
Deleted productsRemove from index + filter at query time
Full re-indexNightly batch rebuild with latest model

Step 6: Evaluation (5 min)

Offline Metrics

MetricWhat It Measures
Recall@K% of relevant items in top K results
Precision@K% of top K results that are relevant
MRRAverage reciprocal rank of first relevant result

Online Metrics

  • Click-through rate: Do users click on search results?
  • Purchase-through rate: Do users buy from visual search?
  • Query abandonment: Do users give up after seeing results?

Practice Problems

Problem 1: Duplicate Product Detection

Direction

Sellers on your marketplace upload the same product with different photos. How do you detect duplicates?

Key Insight

Near-duplicate detection: Use image embeddings + a similarity threshold. If two products have embedding similarity > 0.95, flag as potential duplicates. But images can be the same product from different angles - combine with text similarity (title, description) and structured features (price, brand). Use a classifier: given two products, predict P(duplicate) using image similarity, text similarity, and feature overlap.

Direction

A user uploads a photo of a red floral dress and wants to find similar dresses - same style but different colors/patterns are fine. How do you handle attributes?

Key Insight

Disentangled embeddings: Train the model to separate style from color/pattern. Use attribute-aware contrastive learning: same-style-different-color pairs are positives. At search time, match on the style component of the embedding while allowing variation in color/pattern. Alternatively: extract attributes (category, style, sleeve length, neckline) and allow users to filter results by specific attributes.

Interview Cheat Sheet

Question PatternFrameworkKey Phrases
"Design visual search"Embed → index → search"CLIP embeddings, HNSW/FAISS index, sub-10ms ANN search"
"How do you scale to 100M?"ANN algorithms"Product quantization reduces memory 4-8x, HNSW gives 99% recall at 1ms"
"Image + text search?"Cross-modal embeddings"CLIP maps images and text to shared space - search either modality"

Spaced Repetition Checkpoints

  • Day 0: Explain the visual search pipeline from memory. What's the embedding → ANN → results flow?
  • Day 3: Compare HNSW vs. IVF-PQ. When would you use each?
  • Day 7: Design visual search for a furniture marketplace in 45 minutes.
  • Day 14: Explain hard negative mining and why it matters for metric learning.
  • Day 21: Mock interview with follow-ups on cross-modal search and scaling to 1B images.

What's Next

© 2026 EngineersOfAI. All rights reserved.