Design: Visual Search - Embedding Models and Nearest Neighbor Search
Reading time: ~22 min | Interview relevance: High | Roles: MLE
The Real Interview Moment
"Design a visual search system where users take a photo and find similar products." You describe extracting features with a CNN. The interviewer asks: "You have 100M product images. How do you search through them in under 100ms? A brute-force comparison over 100M embeddings takes 10 seconds."
Visual search tests your understanding of embedding spaces, approximate nearest neighbor (ANN) algorithms, and the trade-off between search accuracy and latency at scale.
What You Will Master
- Image embedding models for visual similarity
- ANN algorithms: HNSW, IVF, product quantization
- Cross-modal search: image → text, text → image
- Indexing and serving at 100M+ scale
- Relevance feedback and online learning
The Complete Design
Step 1: Requirements (5 min)
Functional requirements:
- Upload a photo → find visually similar products
- Text query → find matching product images
- Filter results by category, price, availability
- 100M product images, 10M searches/day
Non-functional requirements:
- Latency: <200ms end-to-end (including image processing)
- Relevance: Top-5 results are relevant 80%+ of the time
- Freshness: New products searchable within 1 hour
- Index size: Must fit in memory (or close to it) for speed
Step 2: Problem Formulation (5 min)
ML problem type: Metric learning + approximate nearest neighbor search.
The core idea: Map images (and optionally text) into a shared embedding space where similar items are close together. Then, given a query image, find the nearest neighbors in that space.
Step 3: Embedding Model (8 min)
Model Options
| Model | Approach | Embedding Dim | Best For |
|---|---|---|---|
| ResNet + Contrastive Loss | Train on product image pairs | 256-512 | Visual similarity only |
| CLIP | Pre-trained image-text alignment | 512-768 | Cross-modal (image ↔ text) |
| DINO v2 | Self-supervised vision transformer | 768 | General visual features |
| Fine-tuned CLIP | CLIP + fine-tune on product data | 512 | Best for product search |
Recommendation: Start with CLIP (zero-shot cross-modal search), then fine-tune on your product catalog with triplet or contrastive loss.
Training for Visual Similarity
Triplet loss: Given an anchor image, a positive (similar product), and a negative (different product):
Loss = max(0, d(anchor, positive) - d(anchor, negative) + margin)
Hard negative mining: The most important training technique - select negatives that are close to the anchor but from a different category. A black dress vs. a black jacket is a harder negative than a black dress vs. a red car.
Don't use random negatives for training - the model learns nothing from trivially different pairs. Use semi-hard negatives (close but different category) for the best training signal. In the interview, mentioning hard negative mining shows you understand metric learning beyond the textbook.
Step 4: ANN Indexing (8 min)
Why Not Brute Force?
100M vectors × 256 dimensions × 4 bytes = ~100 GB. Brute-force cosine similarity: O(N × D) per query. At 100M vectors: ~10 seconds per query. We need <10ms.
ANN Algorithms
| Algorithm | How It Works | Recall@10 | Latency | Memory |
|---|---|---|---|---|
| HNSW | Hierarchical graph-based search | 99% | 1ms | 100% of vectors |
| IVF-PQ | Cluster + product quantization | 90-95% | 0.5ms | 10-25% of vectors |
| ScaNN | Anisotropic quantization | 95% | 0.3ms | 15% of vectors |
| FAISS (IVF-HNSW-PQ) | Hybrid approach | 97% | 1ms | 20% of vectors |
Key trade-off: Recall vs. latency vs. memory.
- HNSW: Best recall, highest memory
- IVF-PQ: Most memory-efficient, lower recall
- Hybrid: Best balance
Recommendation: HNSW for <10M vectors (fits in memory). IVF-PQ or ScaNN for 100M+ vectors.
Vector Database Options
| Database | Strengths | Deployment |
|---|---|---|
| FAISS | Fastest, most flexible, open-source | Self-hosted (library, not a service) |
| Qdrant | Filtering support, easy deployment | Self-hosted or cloud |
| Pinecone | Fully managed, easy to use | Cloud only |
| Milvus | Distributed, large-scale | Self-hosted |
Step 5: Serving (8 min)
Cross-Modal Search
With CLIP, you get image-to-text and text-to-image search for free:
- Image query: Encode image → search image embedding index
- Text query: Encode text → search same image embedding index (shared space)
- Hybrid query: "Red dress like this" (text + image) → combine embeddings
Index Updates
| Challenge | Solution |
|---|---|
| New products | Add to index in real-time (HNSW supports incremental inserts) |
| Updated images | Re-embed and update vector |
| Deleted products | Remove from index + filter at query time |
| Full re-index | Nightly batch rebuild with latest model |
Step 6: Evaluation (5 min)
Offline Metrics
| Metric | What It Measures |
|---|---|
| Recall@K | % of relevant items in top K results |
| Precision@K | % of top K results that are relevant |
| MRR | Average reciprocal rank of first relevant result |
Online Metrics
- Click-through rate: Do users click on search results?
- Purchase-through rate: Do users buy from visual search?
- Query abandonment: Do users give up after seeing results?
Practice Problems
Problem 1: Duplicate Product Detection
Direction
Sellers on your marketplace upload the same product with different photos. How do you detect duplicates?
Key Insight
Near-duplicate detection: Use image embeddings + a similarity threshold. If two products have embedding similarity > 0.95, flag as potential duplicates. But images can be the same product from different angles - combine with text similarity (title, description) and structured features (price, brand). Use a classifier: given two products, predict P(duplicate) using image similarity, text similarity, and feature overlap.
Problem 2: Fine-Grained Fashion Search
Direction
A user uploads a photo of a red floral dress and wants to find similar dresses - same style but different colors/patterns are fine. How do you handle attributes?
Key Insight
Disentangled embeddings: Train the model to separate style from color/pattern. Use attribute-aware contrastive learning: same-style-different-color pairs are positives. At search time, match on the style component of the embedding while allowing variation in color/pattern. Alternatively: extract attributes (category, style, sleeve length, neckline) and allow users to filter results by specific attributes.
Interview Cheat Sheet
| Question Pattern | Framework | Key Phrases |
|---|---|---|
| "Design visual search" | Embed → index → search | "CLIP embeddings, HNSW/FAISS index, sub-10ms ANN search" |
| "How do you scale to 100M?" | ANN algorithms | "Product quantization reduces memory 4-8x, HNSW gives 99% recall at 1ms" |
| "Image + text search?" | Cross-modal embeddings | "CLIP maps images and text to shared space - search either modality" |
Spaced Repetition Checkpoints
- Day 0: Explain the visual search pipeline from memory. What's the embedding → ANN → results flow?
- Day 3: Compare HNSW vs. IVF-PQ. When would you use each?
- Day 7: Design visual search for a furniture marketplace in 45 minutes.
- Day 14: Explain hard negative mining and why it matters for metric learning.
- Day 21: Mock interview with follow-ups on cross-modal search and scaling to 1B images.
What's Next
- Anomaly Detection - Unsupervised methods for detecting unusual patterns
- Machine Translation - Sequence-to-sequence at scale
