Skip to main content

Multimodal RAG

Reading time: ~30 min | Interview relevance: Very High | Target roles: AI Engineer, ML Engineer, Backend Engineer

The Technical Specification That Broke the Bot

Your team built a RAG system for a manufacturing company's internal knowledge base - 50,000 pages of technical documentation. The indexing pipeline chunked PDFs into 500-token text blocks, embedded them with OpenAI's text-embedding-3-small, stored them in Pinecone, and retrieved the top-5 chunks for each query. Engineers could ask "what is the maximum torque spec for the Model X-7 coupling?" and get an answer grounded in the actual documentation.

The product was a success until the chief engineer's team started using it. Within two weeks, she filed three critical failure reports. The first: a query about electrical panel wiring returned confident text-based instructions - but the actual specification was in a circuit diagram. The text around the diagram said "see Figure 4.3 for the complete wiring schematic" but the diagram itself contained the authoritative information. Your pipeline had indexed the text, not the figure. The answer it returned was a plausible reconstruction from surrounding text, not the actual specification. An electrical engineer used it. The resulting mistake was expensive.

The second failure: a query about "the approved material grades for load-bearing welds" returned a text passage that mentioned the table but not the table contents. The answer was a list of materials from a different section that happened to use similar language. The table on page 89 had 14 specific material codes. Your system had never seen them.

The third: a query about "installation sequence for module 7B" returned a correct procedure - except the procedure included a visual step ("ensure the alignment indicator shows green") that the text described as "confirm visual indicator status." The photo showing what the green indicator looked like was not indexed. Engineers were checking a different indicator.

Text-only RAG has a fundamental blind spot: documents are not just text. They are a combination of text, tables, figures, diagrams, photos, and charts - where the figures often contain the most precise and authoritative information. The system that indexes only text will confidently answer questions using the shadows of figures, not the figures themselves.

Multimodal RAG closes this gap. It enables retrieval systems to find relevant images, figures, tables, and diagrams - and reasoning systems to incorporate that visual evidence in their answers.

What Multimodal RAG Adds to Standard RAG

Standard text-only RAG pipeline:

  1. Chunk documents into text passages
  2. Embed text chunks with a text encoder
  3. Index embeddings in a vector database
  4. At query time: embed query, retrieve top-k chunks, pass to LLM

This fails for any information that lives in non-text form: tables, charts, diagrams, photos, mathematical notation rendered as images, handwritten annotations.

Multimodal RAG extends each step:

  1. Parse documents to extract both text and visual elements (images, figures, tables)
  2. Embed visual elements using a multimodal encoder (CLIP) or caption them with a VLM
  3. Index visual embeddings (or captions) alongside text
  4. At query time: retrieve both text and images, pass both to a multimodal LLM

The challenge is that "retrieve both text and images" sounds simple but involves significant engineering choices with different accuracy, cost, and latency trade-offs.

Three Architecture Patterns

Architecture 1: Extract-Then-Index (Caption-Based)

The simplest approach: extract all figures and images from documents, caption each one with a VLM, index the captions as text. Standard text RAG can then retrieve images based on caption similarity.

Strengths: Works with any text-based retrieval infrastructure. Caption quality can be high with a strong VLM. Easy to implement.

Weaknesses: VLM captioning is expensive at scale ($0.004-0.02 per image). Caption quality is bounded by the VLM - subtle visual details may not be captured in text. Captioning at indexing time adds significant preprocessing cost and latency.

Architecture 2: Embed-Then-Retrieve (CLIP-Based)

Instead of captioning images, embed them directly with CLIP and store the embeddings. Queries are embedded with the same CLIP text encoder. Images are retrieved by text-to-image embedding similarity.

Strengths: No VLM cost at indexing time (CLIP embedding is cheap). Captures visual information that text captions might miss. Enables true cross-modal retrieval.

Weaknesses: CLIP embeddings are general-purpose - they may not capture fine-grained technical details relevant to your domain. CLIP works best for natural images; it underperforms on technical diagrams, charts, and mathematical figures.

Architecture 3: Late Interaction - ColPali

ColPali (Faysse et al., 2024) takes a fundamentally different approach. Instead of treating document pages as objects that need to be summarized or embedded, ColPali treats each page as an image and creates multi-vector, patch-level embeddings.

The insight: a single embedding vector for a document page loses too much spatial information. A technical specification might have a critical table in the top-right corner - a global embedding averages over the whole page and dilutes the table's signal. ColPali generates one embedding per image patch (32x32 pixel tiles), producing a set of 1,000+ vectors per page. Retrieval uses MaxSim (maximum similarity) - the query-document score is the sum of each query token's maximum similarity over all document patch embeddings.

This is the same late interaction idea as ColBERT for text retrieval, applied to visual document pages.

Architecture: ColPali uses PaliGemma (a vision-language model) as its backbone, fine-tuned on a synthetic dataset of visual document page QA pairs. The vision encoder processes the page image; the output patch embeddings are the indexed vectors.

Results: ColPali significantly outperforms CLIP-based retrieval and caption-based retrieval on visual document retrieval benchmarks (DocVQA, InfoVQA). It is especially strong on documents with complex visual layouts - technical tables, charts, slides with diagrams.

Tradeoff: Multi-vector storage is more expensive than single-vector. 1,000 vectors per page vs. 1 vector per page means 1000x more index storage. Retrieval (MaxSim) is also more expensive than standard nearest-neighbor.

PDF Parsing: Extracting Visual Content

The first challenge in multimodal RAG is getting the images out of the documents.

import fitz # PyMuPDF
from pathlib import Path
from PIL import Image
import io
import base64
from dataclasses import dataclass, field


@dataclass
class ExtractedPage:
page_num: int
text: str
images: list[dict] # list of image info dicts
page_image: Image.Image = None # full page as image


@dataclass
class ExtractedDocument:
path: str
pages: list[ExtractedPage] = field(default_factory=list)
total_images: int = 0


def extract_pdf_content(
pdf_path: str,
extract_page_images: bool = True,
page_dpi: int = 150,
min_image_size: int = 50, # minimum width/height in pixels
) -> ExtractedDocument:
"""
Extract text, embedded images, and full-page renders from a PDF.

Args:
pdf_path: Path to the PDF file
extract_page_images: Whether to render each page as an image
page_dpi: DPI for page rendering (150 good quality, 72 fast)
min_image_size: Filter out images smaller than this (decorative icons)
"""
doc = fitz.open(pdf_path)
result = ExtractedDocument(path=pdf_path)

for page_num in range(len(doc)):
page = doc[page_num]

# Extract text
text = page.get_text("text")

# Extract embedded images
images = []
for img_index, img in enumerate(page.get_images(full=True)):
xref = img[0]

try:
base_image = doc.extract_image(xref)
image_bytes = base_image["image"]
img_pil = Image.open(io.BytesIO(image_bytes)).convert("RGB")

# Filter out tiny decorative images
if img_pil.width < min_image_size or img_pil.height < min_image_size:
continue

images.append({
"index": img_index,
"page": page_num,
"width": img_pil.width,
"height": img_pil.height,
"image": img_pil,
"format": base_image.get("ext", "png"),
})
except Exception as e:
print(f"Error extracting image {img_index} from page {page_num}: {e}")

# Render full page as image (for ColPali or page-level analysis)
page_img = None
if extract_page_images:
mat = fitz.Matrix(page_dpi / 72, page_dpi / 72)
pix = page.get_pixmap(matrix=mat)
page_img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)

result.pages.append(ExtractedPage(
page_num=page_num,
text=text,
images=images,
page_image=page_img,
))
result.total_images += len(images)

doc.close()
print(f"Extracted {len(result.pages)} pages, {result.total_images} images from {pdf_path}")
return result


def pil_to_base64(img: Image.Image, format: str = "JPEG", quality: int = 85) -> tuple[str, str]:
"""Convert PIL image to base64 string for API calls."""
buffer = io.BytesIO()
if format == "JPEG" and img.mode == "RGBA":
img = img.convert("RGB")
img.save(buffer, format=format, quality=quality)
buffer.seek(0)
b64 = base64.standard_b64encode(buffer.read()).decode("utf-8")
return b64, f"image/{format.lower()}"

Code: Figure Captioning with Claude

import anthropic
from typing import Optional


def caption_figure_with_claude(
image: Image.Image,
context_text: str = "",
document_type: str = "technical documentation",
model: str = "claude-3-5-sonnet-20241022",
) -> str:
"""
Generate a detailed caption for a figure using Claude.

The caption should capture information that would be retrievable
when someone asks a question about the figure's content.
"""
client = anthropic.Anthropic()

image_b64, media_type = pil_to_base64(image)

system_prompt = f"""You are analyzing figures from {document_type}.
Your task is to create a comprehensive, searchable description of each figure.
The description should:
1. Describe what type of figure this is (chart, diagram, table, photograph, schematic, etc.)
2. Capture all text visible in the figure (labels, values, titles, legend items)
3. Describe the key information conveyed (values, relationships, processes)
4. Include specific numbers, percentages, codes, or identifiers present
5. Note any warnings, notes, or callouts in the figure

Write as a factual description, not a caption. Be specific and complete."""

user_content = []

if context_text:
user_content.append({
"type": "text",
"text": f"The surrounding text context is:\n{context_text[:500]}\n\nNow describe this figure:",
})

user_content.append({
"type": "image",
"source": {
"type": "base64",
"media_type": media_type,
"data": image_b64,
},
})

if not context_text:
user_content.append({
"type": "text",
"text": "Describe this figure in detail:",
})

message = client.messages.create(
model=model,
max_tokens=512,
system=system_prompt,
messages=[{"role": "user", "content": user_content}],
)

return message.content[0].text


def batch_caption_figures(
extracted_doc: ExtractedDocument,
context_window: int = 200,
) -> list[dict]:
"""
Caption all figures in a document.
Returns list of captioning results with metadata.
"""
results = []

for page in extracted_doc.pages:
for img_info in page.images:
# Use surrounding text as context
context = page.text[:context_window] if page.text else ""

print(f"Captioning image {img_info['index']} on page {img_info['page']}...")

caption = caption_figure_with_claude(
image=img_info["image"],
context_text=context,
)

results.append({
"doc_path": extracted_doc.path,
"page_num": img_info["page"],
"image_index": img_info["index"],
"image_size": (img_info["width"], img_info["height"]),
"caption": caption,
"image": img_info["image"],
})

return results

Code: Full Multimodal RAG Pipeline

from openai import OpenAI
import numpy as np
import json
from typing import Union


class MultimodalRAGPipeline:
"""
Production multimodal RAG pipeline combining text and image retrieval.

Architecture:
- Index: extract text + figures from PDFs, caption figures, embed everything
- Retrieval: embed query, find top-k text + image matches
- Generation: pass text + retrieved images to multimodal LLM
"""

def __init__(
self,
openai_api_key: str,
anthropic_api_key: str,
embedding_model: str = "text-embedding-3-small",
generation_model: str = "claude-3-5-sonnet-20241022",
):
self.openai_client = OpenAI(api_key=openai_api_key)
self.anthropic_client = anthropic.Anthropic(api_key=anthropic_api_key)
self.embedding_model = embedding_model
self.generation_model = generation_model

# In-memory index (use Pinecone/Weaviate in production)
self.text_chunks: list[dict] = []
self.figure_chunks: list[dict] = []
self.text_embeddings: list[np.ndarray] = []
self.figure_embeddings: list[np.ndarray] = []

def embed_text(self, text: str) -> np.ndarray:
"""Embed text using OpenAI embeddings."""
response = self.openai_client.embeddings.create(
model=self.embedding_model,
input=text,
)
return np.array(response.data[0].embedding)

def cosine_similarity(self, a: np.ndarray, b: np.ndarray) -> float:
"""Compute cosine similarity between two vectors."""
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-9))

def index_document(
self,
pdf_path: str,
chunk_size: int = 500,
chunk_overlap: int = 50,
) -> dict:
"""Index a PDF document: extract, caption, and embed."""
print(f"Indexing {pdf_path}...")

# 1. Extract content
extracted = extract_pdf_content(pdf_path)

# 2. Index text chunks
for page in extracted.pages:
if not page.text.strip():
continue

# Simple fixed-size chunking (use semantic chunking in production)
words = page.text.split()
for i in range(0, len(words), chunk_size - chunk_overlap):
chunk_words = words[i:i + chunk_size]
if len(chunk_words) < 20:
continue
chunk_text = " ".join(chunk_words)

self.text_chunks.append({
"text": chunk_text,
"source": pdf_path,
"page": page.page_num,
"type": "text",
})
self.text_embeddings.append(self.embed_text(chunk_text))

# 3. Caption and index figures
figure_captions = batch_caption_figures(extracted)

for fig_data in figure_captions:
caption = fig_data["caption"]
self.figure_chunks.append({
"text": caption,
"source": pdf_path,
"page": fig_data["page_num"],
"type": "figure",
"image": fig_data["image"],
"image_index": fig_data["image_index"],
})
self.figure_embeddings.append(self.embed_text(caption))

print(f"Indexed {len(extracted.pages)} pages, "
f"{len(figure_captions)} figures from {pdf_path}")

return {
"text_chunks": len(self.text_chunks),
"figure_chunks": len(self.figure_chunks),
}

def retrieve(
self,
query: str,
top_k_text: int = 3,
top_k_figures: int = 2,
) -> dict:
"""Retrieve relevant text and figures for a query."""
query_embedding = self.embed_text(query)

# Retrieve text chunks
text_scores = [
(i, self.cosine_similarity(query_embedding, emb))
for i, emb in enumerate(self.text_embeddings)
]
text_scores.sort(key=lambda x: x[1], reverse=True)
top_text = [self.text_chunks[i] for i, _ in text_scores[:top_k_text]]

# Retrieve figures
figure_scores = [
(i, self.cosine_similarity(query_embedding, emb))
for i, emb in enumerate(self.figure_embeddings)
]
figure_scores.sort(key=lambda x: x[1], reverse=True)
top_figures = [self.figure_chunks[i] for i, _ in figure_scores[:top_k_figures]]

return {
"text_chunks": top_text,
"figures": top_figures,
}

def answer(
self,
query: str,
top_k_text: int = 3,
top_k_figures: int = 2,
) -> dict:
"""Answer a query using retrieved text and figures."""
retrieved = self.retrieve(query, top_k_text, top_k_figures)

# Build context text
context_parts = []
for chunk in retrieved["text_chunks"]:
context_parts.append(
f"[Text from {Path(chunk['source']).name}, page {chunk['page'] + 1}]\n"
f"{chunk['text']}"
)
context_text = "\n\n---\n\n".join(context_parts)

# Build message content with images
content = []

# Add context text
content.append({
"type": "text",
"text": f"Answer the question based on the provided documentation excerpts and figures.\n\n"
f"CONTEXT:\n{context_text}\n\n"
f"RETRIEVED FIGURES:\n",
})

# Add retrieved figures
for i, fig in enumerate(retrieved["figures"]):
img_b64, media_type = pil_to_base64(fig["image"])
content.append({
"type": "text",
"text": f"[Figure from {Path(fig['source']).name}, page {fig['page'] + 1}]:",
})
content.append({
"type": "image",
"source": {
"type": "base64",
"media_type": media_type,
"data": img_b64,
},
})

# Add the question
content.append({
"type": "text",
"text": f"\nQUESTION: {query}\n\n"
"Provide a precise answer. If the information is in a figure, reference the figure. "
"If you cannot answer from the provided materials, say so.",
})

message = self.anthropic_client.messages.create(
model=self.generation_model,
max_tokens=1024,
messages=[{"role": "user", "content": content}],
)

return {
"answer": message.content[0].text,
"retrieved_text": retrieved["text_chunks"],
"retrieved_figures": [
{
"source": f["source"],
"page": f["page"],
"caption": f["text"],
}
for f in retrieved["figures"]
],
}


# Example usage
if __name__ == "__main__":
import os

pipeline = MultimodalRAGPipeline(
openai_api_key=os.environ["OPENAI_API_KEY"],
anthropic_api_key=os.environ["ANTHROPIC_API_KEY"],
)

# Index documents
pipeline.index_document("technical_manual.pdf")
pipeline.index_document("product_specification.pdf")

# Query
result = pipeline.answer("What is the maximum load rating for the Type B bracket?")
print("Answer:", result["answer"])
print(f"\nRetrieved {len(result['retrieved_text'])} text chunks "
f"and {len(result['retrieved_figures'])} figures")

Code: ColPali-Style Page-Level Retrieval

from transformers import PaliGemmaProcessor, PaliGemmaForConditionalGeneration
import torch
from PIL import Image


class ColPaliRetriever:
"""
ColPali-style visual document retrieval.
Treats document pages as images and creates patch-level embeddings.
"""

def __init__(self, model_name: str = "vidore/colpali"):
self.device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Loading ColPali model on {self.device}...")
self.processor = PaliGemmaProcessor.from_pretrained(model_name)
self.model = PaliGemmaForConditionalGeneration.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
).to(self.device)
self.model.eval()
print("ColPali loaded.")

self.page_embeddings: list[torch.Tensor] = [] # list of (num_patches, D)
self.page_metadata: list[dict] = []

def embed_page(self, page_image: Image.Image) -> torch.Tensor:
"""
Embed a document page image.
Returns: (num_patches, hidden_dim) tensor
"""
inputs = self.processor(
images=page_image,
text="",
return_tensors="pt",
).to(self.device)

with torch.no_grad():
outputs = self.model(
**inputs,
output_hidden_states=True,
)
# Use the last hidden state patch embeddings
patch_embeddings = outputs.hidden_states[-1][:, :256, :] # (1, num_patches, D)
patch_embeddings = patch_embeddings.squeeze(0) # (num_patches, D)

return patch_embeddings.float().cpu()

def embed_query(self, query: str) -> torch.Tensor:
"""
Embed a text query.
Returns: (num_query_tokens, hidden_dim) tensor
"""
inputs = self.processor(
images=None,
text=query,
return_tensors="pt",
).to(self.device)

with torch.no_grad():
outputs = self.model(
**inputs,
output_hidden_states=True,
)
query_embeddings = outputs.hidden_states[-1] # (1, num_tokens, D)
query_embeddings = query_embeddings.squeeze(0) # (num_tokens, D)

return query_embeddings.float().cpu()

def maxsim_score(
self,
query_embeddings: torch.Tensor, # (Q, D)
page_embeddings: torch.Tensor, # (P, D)
) -> float:
"""
Compute MaxSim score between query and document page.
For each query token, find the maximum similarity with any page patch.
Sum these max similarities.
"""
# Normalize
q_norm = torch.nn.functional.normalize(query_embeddings, dim=-1)
p_norm = torch.nn.functional.normalize(page_embeddings, dim=-1)

# (Q, P) similarity matrix
sim_matrix = q_norm @ p_norm.T

# For each query token, take max similarity over all patches
max_sims = sim_matrix.max(dim=1).values # (Q,)

return max_sims.sum().item()

def index_document(self, extracted_doc: ExtractedDocument):
"""Index all pages of a document."""
for page in extracted_doc.pages:
if page.page_image is None:
print(f"Page {page.page_num} has no rendered image, skipping.")
continue

print(f"Indexing page {page.page_num + 1}...")
embeddings = self.embed_page(page.page_image)

self.page_embeddings.append(embeddings)
self.page_metadata.append({
"doc_path": extracted_doc.path,
"page_num": page.page_num,
"page_image": page.page_image,
"text_preview": page.text[:200],
})

print(f"Indexed {len(self.page_metadata)} pages.")

def retrieve(
self,
query: str,
top_k: int = 5,
) -> list[dict]:
"""Retrieve top-k most relevant pages for a query."""
query_embeddings = self.embed_query(query)

scores = []
for i, page_emb in enumerate(self.page_embeddings):
score = self.maxsim_score(query_embeddings, page_emb)
scores.append((i, score))

scores.sort(key=lambda x: x[1], reverse=True)

results = []
for i, score in scores[:top_k]:
meta = self.page_metadata[i].copy()
meta["score"] = score
results.append(meta)

return results

Video RAG: Frame Sampling and Keyframe Extraction

RAG over video follows the same principles but requires temporal sampling. You cannot index every frame - at 30fps, an hour of video is 108,000 frames.

import cv2
import numpy as np


def extract_keyframes(
video_path: str,
method: str = "uniform",
num_frames: int = 30,
scene_threshold: float = 0.4,
) -> list[dict]:
"""
Extract keyframes from a video for indexing.

Methods:
- "uniform": extract frames at uniform intervals
- "scene": extract frames at scene boundaries (content-aware)
"""
cap = cv2.VideoCapture(video_path)
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
fps = cap.get(cv2.CAP_PROP_FPS)
duration = total_frames / fps

keyframes = []

if method == "uniform":
frame_indices = np.linspace(0, total_frames - 1, num_frames, dtype=int)

for frame_idx in frame_indices:
cap.set(cv2.CAP_PROP_POS_FRAMES, int(frame_idx))
ret, frame = cap.read()
if ret:
timestamp = frame_idx / fps
rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
pil_image = Image.fromarray(rgb_frame)

keyframes.append({
"frame_idx": int(frame_idx),
"timestamp": timestamp,
"timestamp_str": f"{int(timestamp // 60):02d}:{int(timestamp % 60):02d}",
"image": pil_image,
})

elif method == "scene":
# Scene detection using frame difference
prev_frame_gray = None

for frame_idx in range(total_frames):
cap.set(cv2.CAP_PROP_POS_FRAMES, frame_idx)
ret, frame = cap.read()
if not ret:
break

gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

if prev_frame_gray is not None:
# Compute frame difference
diff = cv2.absdiff(prev_frame_gray, gray)
mean_diff = diff.mean() / 255.0

if mean_diff > scene_threshold:
timestamp = frame_idx / fps
rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
pil_image = Image.fromarray(rgb_frame)

keyframes.append({
"frame_idx": frame_idx,
"timestamp": timestamp,
"timestamp_str": f"{int(timestamp // 60):02d}:{int(timestamp % 60):02d}",
"image": pil_image,
"scene_diff": mean_diff,
})

prev_frame_gray = gray

cap.release()
print(f"Extracted {len(keyframes)} keyframes from {duration:.1f}s video")
return keyframes

Production Engineering Notes

Cost Estimation for Multimodal Indexing

The major cost driver in multimodal RAG is VLM captioning at index time.

ComponentCostNotes
PDF text extractionNegligiblePyMuPDF is free
CLIP embedding (image)~$0.001/1K imagesFast, cheap
Claude 3 Haiku captioning~$0.004/image1,600 tokens per image
Claude 3 Sonnet captioning~$0.012/imageHigher quality
GPT-4V captioning~$0.010/image
Text embedding~$0.0001/1K tokensOpenAI text-embedding-3-small

For a document corpus of 10,000 images:

  • CLIP embedding: ~$10
  • Claude Haiku captioning: ~$40
  • Claude Sonnet captioning: ~$120

Budget captioning accordingly. For large corpora, start with Haiku or GPT-4V mini and promote high-confidence or high-traffic figures to Sonnet re-captioning.

Chunking Strategy for Mixed Documents

When a document has both text and figures, the retrieval should handle cross-modal references:

def create_contextual_chunks(
extracted_doc: ExtractedDocument,
chunk_size: int = 500,
) -> list[dict]:
"""
Create chunks that include figure references alongside surrounding text.
This allows retrieval to find "see Figure 4.3" and return the figure.
"""
chunks = []

for page in extracted_doc.pages:
# Check if page mentions figures
text = page.text
has_figure_ref = any(
marker in text.lower()
for marker in ["figure", "fig.", "diagram", "chart", "table"]
)

# Text chunks with figure context
words = text.split()
for i in range(0, len(words), chunk_size):
chunk_text = " ".join(words[i:i + chunk_size])
chunks.append({
"text": chunk_text,
"page": page.page_num,
"has_figure_ref": has_figure_ref,
"has_images": len(page.images) > 0,
"type": "text",
})

# For pages with figures, also add a combined text+figure chunk
if page.images and page.text.strip():
summary = f"Page {page.page_num + 1} contains {len(page.images)} figure(s). "
summary += text[:300]
chunks.append({
"text": summary,
"page": page.page_num,
"type": "page_summary",
"images": [img["image"] for img in page.images],
})

return chunks

Multimodal Reranking

A cheap embedding retrieval followed by expensive VLM reranking is a powerful pattern:

def rerank_with_vlm(
query: str,
retrieved_figures: list[dict],
top_k: int = 2,
model: str = "claude-3-haiku-20240307",
) -> list[dict]:
"""
Use a VLM to rerank retrieved figures by relevance to the query.
Run a cheap first-pass CLIP retrieval, then expensive VLM reranking on top-k.
"""
client = anthropic.Anthropic()
scored_figures = []

for fig in retrieved_figures:
img_b64, media_type = pil_to_base64(fig["image"])

message = client.messages.create(
model=model,
max_tokens=10,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": media_type,
"data": img_b64,
},
},
{
"type": "text",
"text": f"Does this figure directly answer or provide information relevant to: '{query}'?\n"
"Reply with a single number 1-5 where 5 = highly relevant, 1 = not relevant.",
},
],
}
],
)

try:
score = int(message.content[0].text.strip()[0])
except (ValueError, IndexError):
score = 1

scored_figures.append({**fig, "vlm_relevance_score": score})

scored_figures.sort(key=lambda x: x["vlm_relevance_score"], reverse=True)
return scored_figures[:top_k]

Common Mistakes

:::danger Captioning Figures Without Surrounding Context Generating captions for figures in isolation misses critical context. A bar chart without a title and axis labels is uninterpretable. Always pass the surrounding text (page title, section header, caption if present, adjacent paragraph) to the VLM when generating figure descriptions. Without context, a chart showing quarterly revenue could be captioned as "a bar chart" - useless for retrieval. :::

:::danger Using the Same Chunk Size for Text and Captions Figure captions are typically 100-300 tokens - much shorter than text chunks. Using aggressive chunking on captions can split them mid-sentence. Keep figure captions as atomic units. Do not chunk them further. :::

:::warning Ignoring Retrieval Quality Metrics Multimodal RAG systems need retrieval evaluation separate from generation evaluation. Track: Recall@K (what fraction of relevant figures appear in top-K?), MRR (mean reciprocal rank of the first relevant result). Without these metrics, you cannot tell whether poor answers are from bad retrieval (did not find the right figure) or bad generation (found it but answered wrong). :::

:::warning Not Caching Figure Captions VLM captioning is the most expensive step in multimodal indexing. Figures do not change between indexing runs for the same document. Always cache captions by (document_hash, figure_index). Recompute only when the document changes. A cache miss on re-indexing a 1,000-figure corpus costs $4-12 in API calls. :::

Interview Questions and Answers

Q1: What are the three main architectures for multimodal RAG, and when would you choose each?

The three architectures are: (1) Extract-then-index (caption-based): extract figures, generate text captions with a VLM, index captions as text. Choose when you already have a text-based RAG infrastructure you want to extend, when query types are predictable enough that captions will cover them, and when indexing latency is acceptable (captioning is slow). (2) Embed-then-retrieve (CLIP-based): embed images directly with CLIP, retrieve by text-to-image similarity. Choose when you need cheap, fast indexing, when natural images are the primary content (product photos, scanned documents with photos), and when CLIP's general-purpose embeddings cover your visual domain. (3) Late interaction (ColPali-style): multi-vector patch-level embeddings, MaxSim retrieval. Choose when document pages have complex visual layouts (technical specs, slides with diagrams), when retrieval precision is critical and you can afford higher index storage and retrieval cost, and when working with visually-rich document types that fail with caption-based approaches.

Q2: How would you handle a PDF where the most important information is in tables?

Tables present a unique challenge because they are neither pure images nor well-represented by their surrounding text. The approaches from least to most sophisticated: (1) Table detection + structured extraction: use PyMuPDF to extract table structures or use a specialized library like pdfplumber or Camelot to parse table cells into structured data. Convert to markdown or CSV and index as text. (2) Visual table embedding: render tables as images and use CLIP or a table-specific VLM (like TableQA) to embed them. Retrieve by visual similarity. (3) VLM captioning with structured output: send table images to Claude or GPT-4V with a prompt asking for structured description including all values, row/column headers, and key statistics. This produces rich, retrievable text that preserves table semantics. For production with many tables, (1) is cheapest and works well for structured tables; (3) is best for complex or embedded tables.

Q3: What is ColPali and why does it outperform CLIP-based retrieval for document retrieval?

ColPali (Faysse et al., 2024) improves on CLIP-based retrieval by using multi-vector patch-level embeddings instead of a single global embedding per page. CLIP compresses an entire document page into one 512-dimensional vector - this averages over all spatial regions and loses the signal from any specific part of the page. ColPali generates one embedding vector per image patch (32x32 tiles), producing 1,000+ vectors per page. Retrieval uses MaxSim: for each query token, find the maximum similarity with any document patch, then sum these per-token maxima. This means a query about "Q3 revenue" can match specifically the Q3 column of a table in the bottom-right corner of the page, even though that table represents a small fraction of the page area. ColPali significantly outperforms CLIP on the DocVQA, InfoVQA, and ViDoRe benchmarks. The tradeoff: 1,000x more storage per page and more expensive retrieval computation. For document corpora where retrieval precision is critical, the accuracy improvement justifies the cost.

Q4: How would you evaluate the quality of a multimodal RAG system?

Evaluation has two components: retrieval quality and generation quality. Retrieval: build an evaluation set of (query, expected_relevant_pages) pairs. Measure Recall@5 (does the expected relevant page appear in top-5?), Precision@5, and MRR. Separately evaluate text retrieval and image/figure retrieval to identify which modality is the bottleneck. Generation: build an evaluation set of (query, expected_answer, source_documents) triples. Measure: answer correctness (does the answer match the expected answer, verified by a judge LLM or human?), grounding fidelity (does the answer accurately reflect the retrieved documents?), and hallucination rate (does the model claim information not present in retrieved content?). For multimodal-specific evaluation: visual answer grounding (when the answer should reference a figure, does it?), image citation accuracy (when a figure is cited, is it the right figure?).

Q5: Design a multimodal RAG system for a large financial institution's research report corpus. The reports contain text, charts, tables, and footnotes with critical numerical data.

Architecture: Parsing - use PyMuPDF for text extraction, custom figure detection (bounding boxes) using a document layout model (LayoutLMv3 or DocLayNet) to classify each region as text, figure, table, or footnote. Table regions go through pdfplumber for structured extraction; figure regions are extracted as images. Indexing - text chunks at 300-500 tokens with 20% overlap. Tables converted to markdown and indexed as text with table-specific metadata. Figures captioned with Claude Sonnet (detail matters in financial charts) and indexed as text. Figure images cached to S3 by document hash + figure position. Retrieval - hybrid retrieval: dense text embeddings + BM25 sparse retrieval for numerical values (BM25 handles "12.3% growth Q2 2024" better than dense embeddings). Figure retrieval by caption embeddings. Reranking with a cross-encoder. Generation - Claude Sonnet as the generator. Retrieved figures passed as images. System prompt instructs the model to cite specific pages/figures and to flag when it is uncertain about numerical values. Post-processing: extract all numbers from the response and verify against source. Flag discrepancies for human review. The financial context requires conservative hallucination handling - any numerical claim that cannot be traced to a specific source should be flagged rather than passed through.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Multimodal RAG demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.