Multimodal Embeddings
Reading time: 25 min | Relevance: AI Engineer, ML Engineer, Research Engineer
The Search That Changed Everything
June 2021. OpenAI releases CLIP (Contrastive Language-Image Pre-Training). Within weeks, the ML community is building things nobody expected: you can search a photo library with natural language ("golden retriever playing in snow"), classify images into categories the model was never trained on ("this is a photo of a cat" vs "a photo of a dog"), and describe images in novel ways.
The trick: CLIP embeds both images and text in the same vector space. An image of a golden retriever and the text "a photo of a golden retriever" land near each other in this space. An image of a cat and the text "a photo of a dog" land far apart. This shared embedding space enables retrieval across modalities without any task-specific training.
CLIP wasn't just a clever research paper - it fundamentally changed how we think about embeddings. Before CLIP, embeddings were unimodal: text models embedded text, image models embedded images, and never the twain shall meet. After CLIP, the question became: can we put everything in the same space? Audio, video, 3D shapes, medical scans, sensor data? If yes, we can search across all of them with natural language.
This lesson covers the key multimodal embedding architectures, their training approaches, practical applications, and how to use them in production systems.
Historical Context
January 2021 - OpenAI publishes CLIP ("Learning Transferable Visual Models From Natural Language Supervision," Radford et al.). Trains on 400 million image-text pairs from the web. Achieves remarkable zero-shot performance on ImageNet.
2021-2022 - OpenCLIP (LAION-AI) reproduces and extends CLIP using open datasets. LAION-5B provides 5 billion image-text pairs for training.
2022 - Meta publishes ImageBind ("Imagebind: One embedding space to bind them all," Girdhar et al.), extending the shared embedding space to 6 modalities: text, image, audio, depth, thermal, and IMU data.
2023 - Google publishes SigLIP ("Sigmoid Loss for Language Image Pre-Training," Zhai et al.), replacing CLIP's softmax contrastive loss with sigmoid loss. Outperforms CLIP at smaller batch sizes and is better for retrieval tasks.
2024 - ColPali ("Efficient Document Retrieval with Vision Language Models") emerges as a breakthrough for document retrieval, embedding document pages as images rather than extracting text - dramatically improving retrieval for documents with charts, tables, and mixed layouts.
CLIP: The Breakthrough Architecture
CLIP (Radford et al. 2021) is conceptually simple but historically important:
Architecture
Two encoders (image and text) produce embeddings in the same dimensional space. Training: contrastive loss that pulls matching image-text pairs together and pushes non-matching pairs apart.
The training data
CLIP was trained on 400 million (image, text) pairs scraped from the internet. Each pair consists of an image and its associated alt text, caption, or surrounding text. This "natural supervision" - real-world text that humans wrote to describe images - provides rich training signal without manual annotation.
The scale (400M pairs) is what makes CLIP work. Earlier attempts at cross-modal learning used smaller, cleaner datasets. CLIP showed that scale + noisy web data + simple contrastive objective beats curated datasets with complex objectives.
CLIP's training objective
The contrastive loss for a batch of N image-text pairs:
where is cosine similarity between image embedding and text embedding , and is a learned temperature parameter. This is symmetric InfoNCE - applied in both image→text and text→image directions.
import torch
import torch.nn.functional as F
from transformers import CLIPModel, CLIPProcessor
# Using CLIP via Hugging Face transformers
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
def embed_images(images: list) -> torch.Tensor:
"""Embed images using CLIP."""
inputs = processor(images=images, return_tensors="pt", padding=True)
with torch.no_grad():
image_features = model.get_image_features(**inputs)
return F.normalize(image_features, dim=-1)
def embed_texts_clip(texts: list[str]) -> torch.Tensor:
"""Embed texts using CLIP text encoder."""
inputs = processor(text=texts, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
text_features = model.get_text_features(**inputs)
return F.normalize(text_features, dim=-1)
def image_text_similarity(image, text: str) -> float:
"""Compute similarity between an image and text description."""
img_emb = embed_images([image])
txt_emb = embed_texts_clip([text])
return float(img_emb @ txt_emb.T)
# Zero-shot image classification
def zero_shot_classify(image, candidate_labels: list[str]) -> dict:
"""
Classify an image using CLIP's zero-shot capabilities.
No training required - just image-text similarity.
"""
# Format labels as natural language descriptions
label_texts = [f"a photo of {label}" for label in candidate_labels]
img_emb = embed_images([image])
txt_embs = embed_texts_clip(label_texts)
sims = (img_emb @ txt_embs.T)[0] # (n_labels,)
probs = F.softmax(sims, dim=0)
return {
label: float(prob)
for label, prob in zip(candidate_labels, probs)
}
CLIP applications
Zero-shot classification: Compare image embedding to text descriptions of each class. No fine-tuning required. CLIP achieves 76.2% accuracy on ImageNet zero-shot - the same as ResNet-50 trained on ImageNet with full supervision.
Image retrieval with text queries: Embed a text query and search an image index by cosine similarity. "Show me photos of red flowers" returns images of red flowers.
Text retrieval with image queries: Embed an image and search a text corpus. Upload a photo of a product and find matching product descriptions.
Image captioning (partially): CLIP doesn't generate text, but its image embeddings can be combined with language models for captioning.
OpenCLIP: Open-Source Reproduction
OpenCLIP (LAION-AI, 2022) is an open-source implementation of CLIP trained on publicly available datasets, enabling research and commercial use without OpenAI's proprietary model weights.
Key datasets used:
- LAION-400M: 400M English image-text pairs (WebImageText scale replica)
- LAION-5B: 5 billion image-text pairs across multiple languages
- DataComp: A community benchmark for training image-text models
OpenCLIP models are available via Hugging Face and the open_clip library:
import open_clip
import torch
from PIL import Image
# Load OpenCLIP model
model, _, preprocess = open_clip.create_model_and_transforms(
"ViT-G-14", # Large ViT model
pretrained="laion2b_s34b_b88k"
)
tokenizer = open_clip.get_tokenizer("ViT-G-14")
model.eval()
def embed_image_openclip(image: Image.Image) -> torch.Tensor:
image_tensor = preprocess(image).unsqueeze(0)
with torch.no_grad(), torch.cuda.amp.autocast():
image_features = model.encode_image(image_tensor)
image_features /= image_features.norm(dim=-1, keepdim=True)
return image_features
def embed_text_openclip(text: str) -> torch.Tensor:
tokens = tokenizer([text])
with torch.no_grad(), torch.cuda.amp.autocast():
text_features = model.encode_text(tokens)
text_features /= text_features.norm(dim=-1, keepdim=True)
return text_features
SigLIP: Better for Retrieval
Zhai et al. (2023) published SigLIP ("Sigmoid Loss for Language Image Pre-Training"), which replaces CLIP's softmax contrastive loss with sigmoid binary cross-entropy applied independently to each image-text pair:
where if is a matching pair and otherwise.
Why sigmoid beats softmax for retrieval
CLIP's softmax: The loss normalizes across the entire batch. A pair's gradient depends on all other pairs in the batch. This requires very large batch sizes (CLIP used 32,768) for stable training - at smaller batches, there aren't enough negatives.
SigLIP's sigmoid: Each pair is treated independently as a binary classification (is this pair matching: yes/no?). No normalization across the batch. Works well at smaller batch sizes and scales better to large datasets.
Retrieval quality: SigLIP models consistently outperform CLIP on retrieval benchmarks (image-to-text retrieval, text-to-image retrieval) because the sigmoid loss more directly models the retrieval task.
from transformers import AutoModel, AutoProcessor
# SigLIP via Hugging Face
siglip_model = AutoModel.from_pretrained("google/siglip-large-patch16-384")
siglip_processor = AutoProcessor.from_pretrained("google/siglip-large-patch16-384")
def compute_siglip_similarity(image, text: str) -> float:
"""SigLIP computes image-text similarity with a learned temperature."""
inputs = siglip_processor(
text=[text],
images=[image],
padding=True,
return_tensors="pt"
)
with torch.no_grad():
outputs = siglip_model(**inputs)
# SigLIP logits are already temperature-scaled
logits_per_image = outputs.logits_per_image
# Apply sigmoid for probability (not softmax!)
probs = torch.sigmoid(logits_per_image)
return float(probs[0, 0])
ImageBind: Six Modalities in One Space
Meta's ImageBind (Girdhar et al. 2023) extends the shared embedding idea to six modalities:
- Images (RGB photos)
- Text (natural language)
- Audio (waveforms)
- Depth (depth sensor data)
- Thermal (infrared images)
- IMU (accelerometer/gyroscope data)
The key insight: all six modalities can be mapped to the same embedding space by training each modality against images as a "binding modality." Images are ubiquitous and co-occur with other modalities (images have alt text, images with audio in video, depth sensors paired with RGB cameras).
Because all modalities share the same embedding space, you can do cross-modal retrieval between any pair - not just image-text. "Find audio clips that sound like what this image looks like" (sound of jungle + photo of jungle). "Find images that match this accelerometer pattern" (running motion → photos of running).
CLAP: Contrastive Language-Audio Pretraining
CLAP (Elizabeth et al. 2022, Laion-AI) extends the CLIP framework to audio-text pairs:
- Audio encoder: Uses CNN14 or HTSAT (Hierarchical Token-Semantic Audio Transformer)
- Text encoder: RoBERTa or similar
- Training data: Pairs of audio clips with textual descriptions from AudioCaps, Clotho, FreeSound
Applications:
- Text-to-audio retrieval: "Find me ambient sounds of rain on a tin roof"
- Zero-shot audio classification: Classify environmental sounds into categories
- Audio-text alignment: Verify that a music track matches a mood description
from transformers import AutoProcessor, ClapModel
import librosa
import numpy as np
clap_model = ClapModel.from_pretrained("laion/clap-htsat-unfused")
clap_processor = AutoProcessor.from_pretrained("laion/clap-htsat-unfused")
def embed_audio(audio_file: str, sample_rate: int = 48000) -> torch.Tensor:
"""Embed audio using CLAP."""
waveform, sr = librosa.load(audio_file, sr=sample_rate)
inputs = clap_processor(
audios=waveform,
sampling_rate=sample_rate,
return_tensors="pt"
)
with torch.no_grad():
audio_features = clap_model.get_audio_features(**inputs)
return F.normalize(audio_features, dim=-1)
def embed_text_clap(text: str) -> torch.Tensor:
"""Embed text using CLAP text encoder."""
inputs = clap_processor(text=text, return_tensors="pt")
with torch.no_grad():
text_features = clap_model.get_text_features(**inputs)
return F.normalize(text_features, dim=-1)
def audio_text_similarity(audio_file: str, description: str) -> float:
audio_emb = embed_audio(audio_file)
text_emb = embed_text_clap(description)
return float((audio_emb @ text_emb.T)[0, 0])
ColPali: Document Retrieval with Visual Embeddings
ColPali (Faysse et al. 2024) is a breakthrough for document retrieval. Traditional document retrieval extracts text from PDFs (OCR) then embeds the text. This fails for:
- Documents with complex layouts (multi-column, tables)
- Documents where meaning is in charts and graphs
- Documents where layout and visual arrangement conveys information (infographics)
ColPali instead treats each document page as an image and uses a vision-language model (PaliGemma) to produce multi-vector embeddings - one embedding vector per patch of the image.
ColPali's MaxSim scoring
Instead of a single similarity score between query and document, ColPali uses "late interaction" (inspired by ColBERT):
For each query token's embedding, find the document patch that's most similar to it. Sum these "maximum similarities" across all query tokens. This allows precise localization of relevant content within a document page.
import torch
from transformers import PaliGemmaForConditionalGeneration, AutoProcessor
# ColPali implementation (simplified)
class ColPaliRetriever:
"""Document retrieval using visual embeddings with late interaction."""
def __init__(self, model_name: str = "vidore/colpali"):
self.processor = AutoProcessor.from_pretrained(model_name)
self.model = PaliGemmaForConditionalGeneration.from_pretrained(model_name)
self.model.eval()
def embed_document_page(self, page_image) -> torch.Tensor:
"""
Embed a document page as multiple patch embeddings.
Returns (n_patches, dim) tensor.
"""
inputs = self.processor(images=page_image, return_tensors="pt")
with torch.no_grad():
outputs = self.model(**inputs, output_hidden_states=True)
# Extract patch embeddings from last hidden state
patch_embeddings = outputs.hidden_states[-1] # (1, n_patches, dim)
return patch_embeddings[0] # (n_patches, dim)
def embed_query(self, query: str) -> torch.Tensor:
"""
Embed a text query as multiple token embeddings.
Returns (n_tokens, dim) tensor.
"""
inputs = self.processor(text=query, return_tensors="pt")
with torch.no_grad():
outputs = self.model(**inputs, output_hidden_states=True)
return outputs.hidden_states[-1][0] # (n_tokens, dim)
def maxsim_score(
self,
query_embeddings: torch.Tensor, # (n_query_tokens, dim)
doc_embeddings: torch.Tensor, # (n_doc_patches, dim)
) -> float:
"""
Late interaction MaxSim scoring.
For each query token, find the most similar document patch.
Sum across query tokens.
"""
# Normalize
q = F.normalize(query_embeddings, dim=-1)
d = F.normalize(doc_embeddings, dim=-1)
# All-pairs similarity: (n_query, n_doc)
sims = q @ d.T
# MaxSim: for each query token, take max similarity with any doc patch
max_sims = sims.max(dim=-1).values # (n_query,)
return float(max_sims.sum())
Why ColPali matters for production RAG
Traditional text-extraction RAG has serious failures:
- Complex table layouts get garbled by OCR
- Charts and graphs lose their meaning when converted to text descriptions
- Document structure (headers, footnotes, sidebars) is often lost
ColPali's visual approach handles all of these naturally - a chart is just a set of image patches, and the model can learn which query terms correspond to which visual patterns.
Early benchmarks show ColPali outperforms text-extraction RAG by 20-30 points on DocVQA (Document Visual Q&A) and similar benchmarks.
Production Architecture: Multimodal Search
Here's a production multimodal search architecture combining text and image embeddings:
from dataclasses import dataclass
from typing import Union
from PIL import Image
@dataclass
class MultimodalDocument:
doc_id: str
text_content: str | None = None
image: Image.Image | None = None
metadata: dict = None
class MultimodalSearchIndex:
"""
Production multimodal search combining text and image embeddings.
Supports text queries, image queries, and mixed queries.
"""
def __init__(self, text_model, clip_model, text_weight: float = 0.5):
self.text_model = text_model # For text document embedding
self.clip_model = clip_model # For image and cross-modal embedding
self.text_weight = text_weight # Blend ratio for hybrid search
self.image_weight = 1 - text_weight
self.text_index = None # FAISS index for text embeddings
self.image_index = None # FAISS index for image embeddings
self.docs = []
def index_documents(self, documents: list[MultimodalDocument]):
"""Index a mixed collection of text and image documents."""
self.docs = documents
text_embs = []
image_embs = []
for doc in documents:
if doc.text_content:
text_emb = self.text_model.encode(
[doc.text_content], normalize_embeddings=True
)[0]
else:
text_emb = np.zeros(768) # Placeholder
if doc.image:
image_emb = embed_images([doc.image])[0].numpy()
else:
image_emb = np.zeros(512) # Placeholder
text_embs.append(text_emb)
image_embs.append(image_emb)
text_embs = np.array(text_embs)
image_embs = np.array(image_embs)
self.text_index = faiss.IndexFlatIP(text_embs.shape[1])
self.text_index.add(text_embs.astype(np.float32))
self.image_index = faiss.IndexFlatIP(image_embs.shape[1])
self.image_index.add(image_embs.astype(np.float32))
def search(
self,
query: Union[str, Image.Image],
k: int = 10,
) -> list[dict]:
"""
Search with text or image query.
Uses hybrid scoring: text similarity + image-text similarity.
"""
if isinstance(query, str):
# Text query → search both text and image (via CLIP text encoder)
text_query_emb = self.text_model.encode([query], normalize_embeddings=True)[0]
clip_text_emb = embed_texts_clip([query])[0].numpy()
text_sims = self.text_index.search(
text_query_emb.reshape(1, -1).astype(np.float32), len(self.docs)
)[0][0]
image_sims = self.image_index.search(
clip_text_emb.reshape(1, -1).astype(np.float32), len(self.docs)
)[0][0]
elif isinstance(query, Image.Image):
# Image query → search via CLIP image encoder
clip_image_emb = embed_images([query])[0].numpy()
text_sims = np.zeros(len(self.docs))
image_sims = self.image_index.search(
clip_image_emb.reshape(1, -1).astype(np.float32), len(self.docs)
)[0][0]
# Hybrid score
final_sims = (
self.text_weight * text_sims +
self.image_weight * image_sims
)
top_k_indices = np.argsort(-final_sims)[:k]
return [
{"doc_id": self.docs[i].doc_id, "score": float(final_sims[i])}
for i in top_k_indices
]
Common Mistakes
:::danger Using CLIP for dense text retrieval CLIP's text encoder is optimized for image-text matching, not text-to-text retrieval. Its text representations are significantly weaker for semantic search between text documents than SBERT or E5. Use CLIP for image-text tasks; use dedicated text embedding models (BGE, E5, SBERT) for text-to-text retrieval. :::
:::warning Not normalizing before cross-modal comparison CLIP and SigLIP produce embeddings that should be L2-normalized before computing cosine similarity. Both models' losses operate on normalized embeddings, so unnormalized embeddings produce incorrect similarity scores. Always normalize image and text embeddings before comparison. :::
:::warning Expecting zero-shot CLIP to work for all image types CLIP was trained on natural photos from the internet. It performs much worse on medical images, satellite imagery, scientific charts, and other specialized image types. For specialized image domains, consider fine-tuning CLIP on domain-specific image-text pairs or using domain-specific VLMs. :::
:::tip Use SigLIP instead of CLIP for new retrieval applications
SigLIP consistently outperforms CLIP on image retrieval benchmarks while being easier to train. For any new project requiring image-text embedding, prefer google/siglip-large-patch16-384 over OpenAI CLIP. OpenCLIP's ViT-G-14 also outperforms the original CLIP in most benchmarks.
:::
Interview Q&A
Q1: How does CLIP work and what makes it powerful?
CLIP trains two encoders - an image encoder and a text encoder - to produce embeddings in the same vector space. Training data is 400M image-text pairs from the web, with a contrastive objective: matching (image, text) pairs are pulled together; non-matching pairs are pushed apart. The power comes from the shared embedding space: after training, you can compare any image embedding to any text embedding and get a meaningful similarity score. This enables zero-shot image classification (compare image to text descriptions of each class), cross-modal retrieval (search images with text, or text with images), and transfer to new visual concepts without fine-tuning. CLIP achieves 76.2% accuracy on ImageNet zero-shot - equal to ResNet-50 trained with full supervision.
Q2: What is SigLIP and how does it differ from CLIP?
SigLIP replaces CLIP's softmax contrastive loss with sigmoid binary cross-entropy. In CLIP, the loss for each pair is computed relative to all other pairs in the batch (softmax normalization requires seeing all pairs simultaneously). In SigLIP, each pair is classified independently as matching or not-matching. This means SigLIP trains stably at smaller batch sizes (CLIP needed 32,768 samples), scales better to large datasets, and produces better retrieval performance because the sigmoid loss more directly models whether a query matches a document (yes/no) rather than whether it matches better than other documents in the current batch.
Q3: What is ImageBind and why is it significant?
ImageBind embeds six modalities - image, text, audio, depth, thermal, and IMU - in the same vector space. The key innovation: train each modality against images (the "binding modality") rather than against all other modalities. Since images co-occur naturally with text (captions), audio (video soundtracks), depth (RGB-D cameras), and other modalities, this training is possible with naturally collected data. The result: cross-modal retrieval between any pair of modalities without specific pairwise training data. You can find audio clips that match an image, find images that match an accelerometer pattern, etc.
Q4: What is ColPali and when would you use it for document retrieval?
ColPali treats document pages as images and uses a vision-language model to produce patch-level embeddings (one embedding per image patch). It uses late interaction scoring: for each query token, find the maximum similarity with any document patch, then sum across query tokens. This is much better than text-extraction RAG for documents where: (1) layout complexity makes OCR unreliable, (2) charts/graphs contain key information, (3) tables have complex cross-references, or (4) visual design conveys information that text can't capture. ColPali outperforms text-extraction RAG by 20-30 points on document Q&A benchmarks. Use it when your document corpus contains rich visual content that traditional text extraction misses.
Summary
Multimodal embeddings extend the shared-embedding-space concept beyond text to images, audio, and other modalities:
- CLIP (OpenAI, 2021): First large-scale image-text embedding. 400M training pairs. Enables zero-shot classification and cross-modal retrieval.
- OpenCLIP: Open-source CLIP reproduction. LAION-5B training. Often outperforms original CLIP.
- SigLIP (Google, 2023): Sigmoid loss instead of softmax. Better retrieval quality, smaller batch size requirement.
- ImageBind (Meta, 2023): Six modalities in one space. Images as binding modality.
- CLAP: Audio-text contrastive learning. Enables text-to-audio retrieval.
- ColPali (2024): Document retrieval as image retrieval. Late interaction scoring over image patches. Best for complex documents with charts and tables.
Production pattern: combine text embeddings (for pure text search) with CLIP/SigLIP embeddings (for visual content) using hybrid scoring. For document retrieval with complex layouts, ColPali's visual approach often outperforms text extraction.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the CLIP Contrastive Learning demo on the EngineersOfAI Playground - no code required.
:::
