Visual Search and Product Discovery

The Photo That Started a Purchase

A customer is at a rooftop bar in Barcelona. The host is wearing a pair of white linen trousers that are exactly what they have been looking for. They take a discreet photo. They open Pinterest. They tap the camera icon, point it at the photo, and within 2 seconds they have 40 results matching the style, fabric, and cut - linked directly to purchase pages on ASOS, Zara, and Net-a-Porter.

This is visual search in 2024. And the underlying technology - teaching a machine to understand visual similarity across millions of product images in real time - is one of the most practically useful applications of deep learning in retail.

The impact is measurable. Retailers that deploy visual search see 48% higher average order values from users who engage with it (Slyce, 2022). Pinterest reports that "Shop the Look" features have driven 200 million+ monthly visual searches. The conversion rate from visual search clicks is 2-3x higher than from text search clicks. Why? Because the user knows exactly what they want - they can see it.

The technical challenge is deceptively hard. You are not just matching pixels. You need to match semantic visual concepts across images taken under completely different conditions: different lighting, different backgrounds, different angles, different camera qualities. A dress worn by a model in a studio with perfect lighting needs to match the same dress worn casually in a street photo. The geometry differs, the colors shift with lighting, the background is noise. A model that memorizes pixel patterns fails immediately.

Why This Exists

Text search has a fundamental limitation for fashion, home decor, and lifestyle products: the vocabulary is imprecise. How do you search for "that kind of earthy terracotta mid-century modern lamp I saw at my friend's house"? You cannot. Text forces users to translate visual concepts into words, which they often cannot do accurately for design and aesthetic attributes.

This translation problem causes search to fail at exactly the moments of highest purchase intent. Users who cannot articulate what they want in words abandon the search. The product exists in the catalog. The user wants it. But text search cannot bridge the gap.

Visual search inverts this. Users provide an image (their own photo, a screenshot, a saved image) and the system finds semantically similar items. The user's intent is encoded in the image itself - no vocabulary needed.

Additionally, visual search enables serendipitous discovery that keyword search cannot. Text search requires knowing what to search for. A browsing user who sees a product they like while scrolling can instantly find similar options without knowing the brand, style name, or material. This is the digital equivalent of a retail associate saying "if you like this, you might love these."

Historical Context

Visual search predates modern deep learning, but the early systems were crude.

Content-Based Image Retrieval (CBIR) systems from the 1990s used handcrafted features: color histograms, edge detectors (Sobel, Canny), texture descriptors (Gabor filters, LBP). IBM's QBIC system (1993) was one of the first commercial visual search products. Results were poor by modern standards - the system matched color distributions but could not understand semantic content.

The ImageNet revolution (2012) changed everything. AlexNet's breakthrough on ImageNet classification showed that deep convolutional neural networks could learn visual features far superior to handcrafted approaches. Researchers immediately saw the implication for retrieval: use the penultimate layer of a trained CNN as an image embedding, then do similarity search in that embedding space.

The first generation of visual search systems (2014-2016) fine-tuned VGG or ResNet on product images and used these as embeddings. Pinterest launched their Visual Discovery (Lens) feature in 2015. Amazon launched Product Photo Search. The quality was already dramatically better than CBIR.

The second generation (2018-2021) brought metric learning - training specifically for similarity rather than classification. Triplet loss and contrastive loss optimized embeddings directly for retrieval: similar items pulled close in embedding space, dissimilar items pushed apart.

CLIP (Contrastive Language-Image Pretraining, OpenAI 2021) was the paradigm shift for retail specifically. CLIP trains on 400 million image-text pairs scraped from the internet, learning a joint embedding space where images and their text descriptions are close. This enables zero-shot visual search with text queries ("show me red evening dresses") and enables combining image and text queries naturally. CLIP embeddings transfer exceptionally well to retail product images with minimal fine-tuning.

Core Concepts

Image Embeddings for Similarity

An image embedding is a dense vector representation of an image such that visually similar images have nearby vectors in embedding space. The key property is not that similar images have identical embeddings - it is that the distance between embeddings correlates with visual similarity as a human would judge it.

The choice of similarity metric matters:

Cosine similarity: dot product of L2-normalized vectors. Invariant to vector magnitude. Standard for retrieval.
L2 (Euclidean) distance: sensitive to magnitude differences. Less common for embeddings.
Inner product: equivalent to cosine if vectors are normalized.

For CLIP, the embedding is produced by:

Pass the image through the CLIP Vision Encoder (ViT-B/32 or ViT-L/14)
Get the CLS token output (global image representation)
Project through a linear layer to get a 512 or 768-dimensional embedding
L2-normalize the embedding

The critical property of CLIP embeddings for retail: because CLIP was trained on image-text pairs describing objects and their attributes, the embedding space captures semantic attributes - color, style, shape, material - not just surface texture.

Product Attribute Extraction

Beyond pure similarity search, retail applications often need explicit product attributes: category (dress, pants, shoes), color, style (bohemian, minimalist, vintage), material (cotton, silk, denim), pattern (solid, striped, floral), fit (slim, relaxed, oversized).

Two approaches:

Classification-based: Train separate classifiers for each attribute. Pros: interpretable, well-understood. Cons: requires labeled data for each attribute, does not generalize to unseen attributes.

CLIP-based zero-shot: For each attribute value, create a text prompt ("a photo of a red dress", "a photo of a blue dress") and compute the CLIP text embedding. Assign the attribute whose text embedding is closest to the image embedding. Pros: no labeled training data needed for new attributes. Cons: less accurate than supervised for known attributes.

In production, the combination works best: supervised classifiers for high-traffic attributes (category, color) where labeled data is abundant, CLIP zero-shot for long-tail attributes (aesthetic style, occasion type).

CLIP in Detail

CLIP (Radford et al., 2021) jointly trains two encoders:

Image encoder: ViT (Vision Transformer) or ResNet
Text encoder: Transformer

Training objective: given a batch of (image, text) pairs, maximize the cosine similarity between matching pairs while minimizing similarity between non-matching pairs. This is contrastive learning at scale.

The result: a shared embedding space where "a red evening gown" (text) and an image of a red evening gown are geometrically close. This enables:

Text-to-image search: encode query text, find nearest images
Image-to-image search: encode query image, find nearest images
Combined (multimodal) search: combine text and image embeddings

For retail, multimodal search is particularly powerful: "find something like this [image] but in navy blue [text]". The user provides visual context and text modification simultaneously.

Approximate Nearest Neighbor at Scale

A product catalog of 1 million items, each represented as a 512-dimensional CLIP embedding, requires efficient similarity search.

FAISS (Facebook AI Similarity Search): The standard library for ANN in production. Key index types:

IndexFlatL2 / IndexFlatIP: Exact search. Use for catalogs under 100K items where latency allows.
IndexIVFFlat: Inverted file index. Clusters embeddings into k cells. At query time, searches only the nearest nprobe cells. 100x faster than exact, 95%+ recall with proper nprobe.
IndexHNSW: Hierarchical Navigable Small World graph. Generally faster than IVF at the same recall level. Better for online insertion of new items.
IndexIVFPQ: Adds Product Quantization (PQ) compression. Stores embeddings in compressed form (32x smaller than float32). Essential for billion-scale catalogs.

ScaNN (Google): Alternative to FAISS with better GPU utilization, used in production at Google.

The retrieval quality metric is Recall@K: what fraction of true nearest neighbors appear in the top-K returned by the ANN index? Production systems target Recall@100 > 90%.

The most sophisticated retail visual search systems combine signals:

Image + text modifier: User provides a query image plus a text description of desired modifications. CLIP enables a simple composition: $e_{query} = e_{image} + \alpha \cdot e_{text}$ , where $\alpha$ is a weight for the text component. More sophisticated: train a Composed Image Retrieval (CIR) model that learns to combine image and text embeddings for this specific task.

Multiple query images: "I want something that looks like this dress [image 1] but with this pattern [image 2]." Encode both images and interpolate embeddings.

Category-aware search: "Find shoes similar to this [image]." Restrict retrieval to items in the shoes category. Combine ANN retrieval with metadata filtering.

Practical Implementation

import torch
import numpy as np
import pandas as pd
from PIL import Image
import requests
from io import BytesIO
import faiss
from pathlib import Path
from typing import List, Optional
import json

# ============================================================
# 1. CLIP Embedding Extraction
# ============================================================

class CLIPEmbedder:
    """
    Extract CLIP embeddings for images and text.
    Uses OpenAI's CLIP model via the clip package.
    """

    def __init__(self, model_name: str = 'ViT-B/32', device: str = None):
        try:
            import clip
            self.clip = clip
        except ImportError:
            raise ImportError("Install with: pip install clip-by-openai")

        self.device = device or ('cuda' if torch.cuda.is_available() else 'cpu')
        self.model, self.preprocess = self.clip.load(model_name, device=self.device)
        self.model.eval()

        # Get embedding dimension
        dummy = torch.zeros(1, 3, 224, 224).to(self.device)
        with torch.no_grad():
            self.embedding_dim = self.model.encode_image(dummy).shape[-1]

        print(f"CLIP model: {model_name}, embedding dim: {self.embedding_dim}, device: {self.device}")

    def embed_image(self, image: Image.Image) -> np.ndarray:
        """Embed a single PIL image. Returns normalized embedding."""
        image_tensor = self.preprocess(image).unsqueeze(0).to(self.device)
        with torch.no_grad():
            embedding = self.model.encode_image(image_tensor)
            embedding = embedding / embedding.norm(dim=-1, keepdim=True)
        return embedding.cpu().numpy().squeeze()

    def embed_images_batch(
        self,
        images: List[Image.Image],
        batch_size: int = 64
    ) -> np.ndarray:
        """
        Embed a list of PIL images in batches.
        Returns array of shape (N, embedding_dim).
        """
        all_embeddings = []
        for i in range(0, len(images), batch_size):
            batch = images[i:i + batch_size]
            tensors = torch.stack([self.preprocess(img) for img in batch]).to(self.device)
            with torch.no_grad():
                embeddings = self.model.encode_image(tensors)
                embeddings = embeddings / embeddings.norm(dim=-1, keepdim=True)
            all_embeddings.append(embeddings.cpu().numpy())
        return np.vstack(all_embeddings)

    def embed_text(self, text: str) -> np.ndarray:
        """Embed a text query. Returns normalized embedding."""
        tokens = self.clip.tokenize([text]).to(self.device)
        with torch.no_grad():
            embedding = self.model.encode_text(tokens)
            embedding = embedding / embedding.norm(dim=-1, keepdim=True)
        return embedding.cpu().numpy().squeeze()

    def embed_texts_batch(self, texts: List[str]) -> np.ndarray:
        """Embed a list of text queries."""
        tokens = self.clip.tokenize(texts).to(self.device)
        with torch.no_grad():
            embeddings = self.model.encode_text(tokens)
            embeddings = embeddings / embeddings.norm(dim=-1, keepdim=True)
        return embeddings.cpu().numpy()

    def zero_shot_classify(
        self,
        image: Image.Image,
        class_texts: List[str]
    ) -> dict:
        """
        Zero-shot classification of an image into one of several text classes.
        Example: class_texts = ["a red garment", "a blue garment", "a green garment"]
        """
        image_embedding = self.embed_image(image)
        text_embeddings = self.embed_texts_batch(class_texts)

        # Cosine similarities
        similarities = np.dot(image_embedding, text_embeddings.T)
        probs = np.exp(similarities) / np.exp(similarities).sum()

        return {text: float(prob) for text, prob in zip(class_texts, probs)}


# ============================================================
# 2. Product Catalog Indexing
# ============================================================

class ProductVisualIndex:
    """
    Builds and manages a FAISS index of product image embeddings.
    Supports image-to-image, text-to-image, and multimodal search.
    """

    def __init__(
        self,
        embedder: CLIPEmbedder,
        embedding_dim: int = 512,
        index_type: str = 'ivf_flat',
        n_clusters: int = 256,
        nprobe: int = 16
    ):
        self.embedder = embedder
        self.dim = embedding_dim
        self.item_ids = []
        self.item_metadata = {}

        # Build FAISS index
        if index_type == 'flat':
            self.index = faiss.IndexFlatIP(embedding_dim)
        elif index_type == 'ivf_flat':
            quantizer = faiss.IndexFlatIP(embedding_dim)
            self.index = faiss.IndexIVFFlat(
                quantizer, embedding_dim, n_clusters, faiss.METRIC_INNER_PRODUCT
            )
            self.index.nprobe = nprobe
        elif index_type == 'hnsw':
            self.index = faiss.IndexHNSWFlat(embedding_dim, 32, faiss.METRIC_INNER_PRODUCT)

        self._is_trained = index_type == 'flat' or index_type == 'hnsw'

    def load_product_image(self, url: str) -> Optional[Image.Image]:
        """Load product image from URL with error handling."""
        try:
            resp = requests.get(url, timeout=5)
            resp.raise_for_status()
            return Image.open(BytesIO(resp.content)).convert('RGB')
        except Exception as e:
            print(f"Failed to load image from {url}: {e}")
            return None

    def index_catalog(
        self,
        catalog_df: pd.DataFrame,
        image_url_col: str = 'image_url',
        item_id_col: str = 'item_id',
        batch_size: int = 64
    ):
        """
        Build the FAISS index from a product catalog dataframe.
        catalog_df must have: item_id, image_url, and optional metadata columns.
        """
        embeddings_list = []
        valid_item_ids = []

        print(f"Indexing {len(catalog_df)} products...")

        for start in range(0, len(catalog_df), batch_size):
            batch = catalog_df.iloc[start:start + batch_size]
            images = []
            batch_ids = []

            for _, row in batch.iterrows():
                img = self.load_product_image(row[image_url_col])
                if img is not None:
                    images.append(img)
                    batch_ids.append(str(row[item_id_col]))

            if images:
                batch_embeddings = self.embedder.embed_images_batch(images)
                embeddings_list.append(batch_embeddings)
                valid_item_ids.extend(batch_ids)

                # Store metadata for retrieval augmentation
                for item_id, (_, row) in zip(batch_ids, batch.iterrows()):
                    self.item_metadata[item_id] = row.to_dict()

            if (start // batch_size) % 10 == 0:
                print(f"  Processed {min(start + batch_size, len(catalog_df))}/{len(catalog_df)}")

        all_embeddings = np.vstack(embeddings_list).astype(np.float32)
        self.item_ids = valid_item_ids

        # Train IVF index if needed
        if not self._is_trained:
            print("Training FAISS index...")
            self.index.train(all_embeddings)
            self._is_trained = True

        self.index.add(all_embeddings)
        print(f"Indexed {len(self.item_ids)} products successfully.")

    def search_by_image(
        self,
        query_image: Image.Image,
        top_k: int = 20,
        category_filter: Optional[str] = None
    ) -> pd.DataFrame:
        """Find visually similar products given a query image."""
        query_embedding = self.embedder.embed_image(query_image)
        return self._search(query_embedding, top_k, category_filter)

    def search_by_text(
        self,
        query_text: str,
        top_k: int = 20,
        category_filter: Optional[str] = None
    ) -> pd.DataFrame:
        """Find products matching a text description."""
        query_embedding = self.embedder.embed_text(query_text)
        return self._search(query_embedding, top_k, category_filter)

    def search_multimodal(
        self,
        query_image: Image.Image,
        text_modifier: str,
        image_weight: float = 0.7,
        top_k: int = 20
    ) -> pd.DataFrame:
        """
        Combine image and text for composed search.
        Example: query_image="blue jeans", text_modifier="in black"
        """
        image_embedding = self.embedder.embed_image(query_image)
        text_embedding = self.embedder.embed_text(text_modifier)

        # Weighted combination in embedding space
        combined = image_weight * image_embedding + (1 - image_weight) * text_embedding
        # Re-normalize
        combined = combined / np.linalg.norm(combined)

        return self._search(combined, top_k)

    def _search(
        self,
        query_embedding: np.ndarray,
        top_k: int,
        category_filter: Optional[str] = None
    ) -> pd.DataFrame:
        """Internal search method."""
        query = query_embedding.astype(np.float32).reshape(1, -1)
        faiss.normalize_L2(query)

        # Search more candidates if filtering by category
        k_search = top_k * 5 if category_filter else top_k
        scores, indices = self.index.search(query, k_search)

        results = []
        for score, idx in zip(scores[0], indices[0]):
            if idx == -1:
                continue
            item_id = self.item_ids[idx]
            metadata = self.item_metadata.get(item_id, {})
            results.append({
                'item_id': item_id,
                'similarity_score': float(score),
                **metadata
            })

        df = pd.DataFrame(results)

        if category_filter and 'category' in df.columns:
            df = df[df['category'] == category_filter]

        return df.head(top_k)


# ============================================================
# 3. Product Attribute Extraction
# ============================================================

class RetailAttributeExtractor:
    """
    Extract product attributes from images using CLIP zero-shot classification.
    """

    # Standard retail attribute taxonomies
    ATTRIBUTE_TAXONOMY = {
        'color': [
            'black', 'white', 'navy blue', 'light blue', 'red', 'pink',
            'green', 'yellow', 'orange', 'purple', 'grey', 'beige', 'brown',
            'multicolor', 'pattern'
        ],
        'style': [
            'casual', 'formal', 'sporty', 'bohemian', 'minimalist',
            'vintage', 'streetwear', 'classic', 'romantic', 'edgy'
        ],
        'pattern': [
            'solid', 'striped', 'plaid', 'floral', 'geometric',
            'animal print', 'abstract', 'polka dots'
        ],
        'season': ['spring/summer', 'fall/winter', 'all season'],
        'fit': ['slim fit', 'regular fit', 'relaxed fit', 'oversized'],
    }

    def __init__(self, embedder: CLIPEmbedder):
        self.embedder = embedder
        # Pre-compute text embeddings for each attribute value
        self._attribute_embeddings = self._precompute_attribute_embeddings()

    def _precompute_attribute_embeddings(self) -> dict:
        """Pre-compute text embeddings for all attribute values once."""
        attr_embeddings = {}
        for attr, values in self.ATTRIBUTE_TAXONOMY.items():
            prompts = [f"a {value} product" for value in values]
            attr_embeddings[attr] = {
                'values': values,
                'embeddings': self.embedder.embed_texts_batch(prompts)
            }
        return attr_embeddings

    def extract_attributes(
        self,
        image: Image.Image,
        threshold: float = 0.2
    ) -> dict:
        """
        Extract all product attributes from an image.
        Returns dict of {attribute: {value: probability}}.
        """
        image_embedding = self.embedder.embed_image(image)
        attributes = {}

        for attr, data in self._attribute_embeddings.items():
            values = data['values']
            text_embeddings = data['embeddings']

            # Compute similarities
            similarities = np.dot(image_embedding, text_embeddings.T)
            probs = np.exp(similarities * 100) / np.exp(similarities * 100).sum()

            # Return top predicted value and its probability
            best_idx = np.argmax(probs)
            attributes[attr] = {
                'predicted': values[best_idx],
                'confidence': float(probs[best_idx]),
                'all_probs': {v: float(p) for v, p in zip(values, probs)}
            }

        return attributes

    def extract_attributes_batch(
        self,
        images: List[Image.Image]
    ) -> List[dict]:
        """Extract attributes for a batch of images."""
        image_embeddings = self.embedder.embed_images_batch(images)
        results = []

        for img_emb in image_embeddings:
            attrs = {}
            for attr, data in self._attribute_embeddings.items():
                text_embeddings = data['embeddings']
                similarities = np.dot(img_emb, text_embeddings.T)
                probs = np.exp(similarities * 100) / np.exp(similarities * 100).sum()
                best_idx = np.argmax(probs)
                attrs[attr] = {
                    'predicted': data['values'][best_idx],
                    'confidence': float(probs[best_idx])
                }
            results.append(attrs)

        return results


# ============================================================
# 4. Shop-the-Look Pipeline
# ============================================================

class ShopTheLookPipeline:
    """
    Given a "look" image (outfit, room scene), identify individual
    products and find purchasable similar items.
    """

    def __init__(
        self,
        embedder: CLIPEmbedder,
        visual_index: ProductVisualIndex
    ):
        self.embedder = embedder
        self.visual_index = visual_index

    def detect_regions(self, image: Image.Image) -> List[Image.Image]:
        """
        Detect product regions in a scene image.
        In production: use a detector (YOLO, Detectron2) trained on fashion/furniture.
        Here we simulate with simple grid crops for illustration.
        """
        w, h = image.size
        # Simulate region detection with overlapping crops
        regions = []
        crop_sizes = [(0.4, 0.4), (0.6, 0.6)]
        offsets = [(0.0, 0.0), (0.3, 0.3), (0.6, 0.0), (0.0, 0.6)]

        for (cw_frac, ch_frac) in crop_sizes:
            cw, ch = int(w * cw_frac), int(h * ch_frac)
            for (ox, oy) in offsets:
                left = int(ox * w)
                top = int(oy * h)
                right = min(left + cw, w)
                bottom = min(top + ch, h)
                if right > left and bottom > top:
                    regions.append(image.crop((left, top, right, bottom)))

        return regions

    def process_look(
        self,
        look_image: Image.Image,
        top_k_per_region: int = 5
    ) -> dict:
        """
        Process a look image: detect regions and find matching products.
        """
        regions = self.detect_regions(look_image)
        look_results = []

        for i, region in enumerate(regions):
            similar_products = self.visual_index.search_by_image(
                region,
                top_k=top_k_per_region
            )
            look_results.append({
                'region_id': i,
                'similar_products': similar_products.to_dict('records')
            })

        return {
            'num_regions_detected': len(regions),
            'product_matches': look_results
        }

Architecture Diagrams

Visual Search System Architecture

Multimodal Search Flow

Production Engineering Notes

Image Quality Challenges

Product images in a retailer's catalog are carefully curated: white backgrounds, professional lighting, multiple angles. User-provided query images are the opposite: taken on phones, in dim restaurants, at awkward angles, with cluttered backgrounds.

Background removal: Before embedding query images, remove the background using a segmentation model (SAM, rembg, or a lighter custom model). This prevents the background from polluting the embedding - a product photo in a green park and the same product on a white background should have similar embeddings.

Angle normalization: For some product categories (furniture, shoes from the side), angle matters enormously. Train the index with multiple angles of each product when available. At query time, try multiple crops and take the best retrieval across all query crops.

Lighting correction: Simple preprocessing: histogram equalization or CLAHE (Contrast Limited Adaptive Histogram Equalization) before embedding. This reduces the sensitivity to lighting conditions.

Index Updates for New Products

New products are added to the catalog continuously - thousands per day at large retailers. The FAISS index must be updated.

HNSW for online insertion: Unlike IVF indexes that require retraining when adding new vectors, HNSW supports dynamic insertion with no quality degradation. At the cost of slightly higher memory usage, HNSW is the right choice when the catalog changes frequently.

Versioned indexes: Maintain two index versions. The "live" index serves production traffic. An "update" index receives new products and is rebuilt nightly. Atomic swap at a low-traffic time window. No downtime.

Embedding recomputation triggers: When the CLIP model is updated (fine-tuned on new data), all product embeddings must be recomputed. This is a batch job, not a real-time operation. Budget for it in your compute capacity planning.

Common Mistakes

:::danger Embedding Products with Lifestyle vs. Catalog Images Many retailers have two types of product images: clean catalog shots (product on white background) and lifestyle shots (product worn by a model in a scene). If you index lifestyle images and users search with catalog images (or vice versa), the embedding space shifts. The model learns the lifestyle context (beach, office, nightclub) as part of the product embedding, not just the product attributes. Build separate indexes for catalog and lifestyle images, or train a domain-adaptation layer that aligns the two distributions. :::

:::danger Ignoring Category Filtering in Retrieval A visual search for a red dress will return results that look visually similar - which might include red shoes, red bags, or red scarves if you do not filter by category. The CLIP embedding space is shared across all product types. Always apply a category pre-filter (if the user is on the dresses category page) or a post-filter (after retrieval, only return items in relevant categories). This is a product decision, not a model decision - but forgetting it destroys the user experience. :::

:::warning Evaluating Visual Search with Offline Metrics Only Standard retrieval metrics (Recall@K, NDCG) measure whether ground-truth similar items appear in results. But "ground truth similar items" is subjective and your labeled dataset may not reflect actual user preferences. Supplement offline evaluation with: (1) human evaluation using preference pairs ("which result is more similar to the query?"); (2) online A/B testing measuring click-through rate and conversion from visual search results. An improvement in Recall@10 does not always translate to better user experience. :::

Interview Questions and Answers

Q1: How does CLIP enable zero-shot visual search without training on your specific product catalog?

A: CLIP (Contrastive Language-Image Pretraining) was trained on 400 million image-text pairs from the internet, learning to align images and their descriptions in a shared 512-dimensional embedding space. The key property for retail: CLIP learned rich visual semantics - it understands "red dress," "leather boot," "floral pattern" - because those descriptions appeared paired with matching images in its training data. For zero-shot visual search, you embed a query image and find catalog items with nearby embeddings. Since CLIP embeddings encode semantic content (not just pixel patterns), a user photo of a dress in a restaurant will be close to catalog images of similar dresses, despite different backgrounds and lighting. "Zero-shot" means you do not need to fine-tune on your specific catalog - CLIP's general visual-semantic understanding transfers. In practice, fine-tuning CLIP on domain-specific data (fashion, furniture, electronics) with contrastive loss on (anchor, positive, negative) triplets improves retrieval precision by 10-20% over the base model.

Q2: Explain how FAISS IVF indexing works and the tradeoff between nprobe and retrieval latency.

A: FAISS IVF (Inverted File Index) approximates nearest neighbor search by partitioning the embedding space into k Voronoi cells using k-means clustering. Each item is assigned to its nearest cluster centroid and stored in that cluster's inverted list. At query time, instead of searching all items, FAISS identifies the nprobe nearest cluster centroids to the query, then searches only the items in those clusters. If you have 1 million items and 1024 clusters, each cluster holds about 1000 items. With nprobe=10, you search about 10,000 items instead of 1,000,000 - 100x speedup. The tradeoff: larger nprobe means more clusters searched, higher recall (more true nearest neighbors found), but higher latency. nprobe=1 is fastest but lowest quality; nprobe=1024 (all clusters) is equivalent to exact search. In production, tune nprobe on a held-out set of query-ground-truth pairs to find the point on the recall-latency curve that meets your SLA. Typically nprobe=16-32 achieves 90-95% recall at 5-15ms latency for 1M-item indexes.

Q3: What is the "semantic gap" problem in visual search and how do modern methods address it?

A: The semantic gap is the disconnect between low-level image features (pixel values, colors, textures) and high-level semantic concepts (product style, occasion, aesthetic). Early CBIR systems matched color histograms - a blue sky and blue jeans would be considered similar. The semantic gap caused completely irrelevant results despite low visual distance in feature space. Modern methods address the gap in two ways. First, representation learning: train embeddings on supervised tasks (classification, retrieval) rather than reconstructing pixels. A model trained to classify "formal dress vs casual dress" learns an embedding space where formal/casual is an axis of variation, bridging the semantic gap for that attribute. Second, contrastive learning with semantically rich training pairs: CLIP's training on image-text pairs uses text descriptions as semantic supervisors. Since text is high-level semantic language, the alignment objective forces the image encoder to produce embeddings that correspond to semantic meaning, not pixel patterns. The result: two images of the same product under radically different lighting conditions are close in CLIP embedding space because both are semantically described by the same caption.

Q4: Describe how you would evaluate the quality of a visual search system in production.

A: Evaluation requires both offline and online metrics, measuring different things. Offline: define a ground truth set of (query image, relevant result images) pairs - can be sourced from human annotators or from click data (if a user clicked result X after querying image Y, that's a weak positive). Compute Recall@K (fraction of relevant results appearing in top-K), Precision@K, and NDCG. Offline metrics are cheap and fast but suffer from label quality issues. Online: A/B test against baseline (typically keyword search or browsing). Primary metrics: click-through rate from visual search results, add-to-cart rate, conversion rate, revenue per visual search session. Also measure: session depth (do users explore more items after visual search?), discovery rate (do they find items they would not have found via text search?). Qualitative: user studies with "find this style" tasks, measuring task completion time and user satisfaction. A common mistake is over-optimizing offline metrics at the expense of diversity - a system that shows 20 nearly-identical products has great Recall@20 but terrible user experience.

Q5: How would you build the "Shop the Look" feature where a user uploads an outfit photo and the system finds all individual purchasable items?

A: Shop the Look requires several components working in sequence. First, region detection: use an object detector fine-tuned on fashion items (or home goods for interior design use cases) to detect individual products in the scene image. YOLO or Detectron2 fine-tuned on DeepFashion2 or similar datasets work well. Each detected bounding box is a candidate product region. Second, region classification: classify each region into a product category (top, bottom, shoes, bag, accessory) to enable category-specific retrieval. Third, region embedding: crop each detected region, background-remove it, and embed with CLIP. Fourth, per-region retrieval: search each region embedding against the product catalog, filtered to the detected category. Return top-K results per region. Fifth, deduplication and ranking: if multiple regions return the same item, boost its score. Apply price and availability filters. The hard engineering challenge is the region detector - it requires labeled training data (bounding boxes around individual products in scene images). Use semi-automatic labeling: start with an existing detector on easy scenes, manually correct the hard cases, use active learning to prioritize the most informative images for annotation.

The Photo That Started a Purchase​

Why This Exists​

Historical Context​

Core Concepts​

Image Embeddings for Similarity​

Product Attribute Extraction​

CLIP in Detail​

Approximate Nearest Neighbor at Scale​

Multi-Modal Search​

Practical Implementation​

Architecture Diagrams​

Visual Search System Architecture​

Multimodal Search Flow​

Production Engineering Notes​

Image Quality Challenges​

Index Updates for New Products​

Common Mistakes​

Interview Questions and Answers​