What is vision language models?

How modern AI systems combine vision encoders with language models to understand and reason about images.

How does VLM work in practice?

Vision-Language Models covers vision language models, VLM, vision transformer from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/multimodal-models/vision-language-models

What is the difference between vision language models and vision transformer?

See the full breakdown at https://engineersofai.com/docs/llms/multimodal-models/vision-language-models

Vision-Language Models

Reading time: ~30 min | Interview relevance: Very High | Target roles: ML Engineer, AI Engineer, Research Engineer

The System That Changed How We Thought About Documents

It is 2 AM and your on-call pager fires. A Fortune 500 insurance company has filed a critical support ticket: their document processing pipeline - the one your team deployed three months ago - is silently dropping entire claim files. The logs show successful ingestion but zero extracted fields. You pull up a sample file and immediately see the problem. The claims are scanned PDFs with handwritten annotations, embedded charts showing injury severity, and photos of the damaged property. Your pipeline, built entirely on text-based LLMs, sees every one of these documents as nearly empty. The OCR extracts some boilerplate text from the margins. The rest is invisible.

You spend four hours writing a post-mortem. The root cause is embarrassingly simple: you built a language pipeline for a world that communicates in images. The insurance adjuster filling out that form did not think about which parts were "text" and which were "visual" - she just communicated what she saw. Your model, however, draws a hard boundary at the pixel level. Anything that is not ASCII does not exist.

This is not a contrived scenario. It is the default failure mode of language-only AI applied to real documents. Invoices have tables. Research papers have figures. Medical records have radiology images. Contracts have handwritten signatures and stamps. Customer support tickets have screenshots. The moment you leave the clean world of pure-text corpora and enter the world of how humans actually communicate, you discover that text alone is a significant minority of the information.

Vision-Language Models (VLMs) exist to close that gap. They give a language model the ability to perceive images - not as a second-class feature bolted on afterward, but as a first-class input modality processed through an encoder that speaks the same representation language as the text tokens the LLM already understands. By the time an image reaches the LLM's attention mechanism, it looks like a sequence of embeddings, just longer and richer than a text token sequence.

The difference between a VLM and a text-only LLM is essentially the answer to one question: what do you do with pixels before they get to the transformer? The answer to that question is the entire story of this lesson.

Why Language-Only Models Cannot See

A language model is, at its core, a function that maps a sequence of token IDs to a probability distribution over the next token ID. Tokens are integers. Images are tensors with three spatial dimensions: height, width, and color channel. There is no natural embedding table for a raw image the way there is for a word.

Early attempts at "multimodal" models tried to bridge this by converting images to text first - using separate OCR pipelines, captioning models, or structured data extractors that ran before the LLM ever saw the input. This works for simple cases but fails the moment the image contains information that does not reduce cleanly to text: spatial relationships, visual patterns, color gradients, the angle of a handwritten annotation, the structure of a complex diagram.

What we actually want is for the model to have learned, during training, a rich internal representation of visual content that it can reason about using the same mechanisms it uses to reason about text. That requires a vision encoder - a neural network that maps an image to a sequence of dense vectors that live in or near the same embedding space as language tokens.

Historical Context: From ConvNets to VLMs

The path to modern VLMs runs through several distinct eras.

2012-2020: CNN dominance. Convolutional neural networks - AlexNet, VGG, ResNet, EfficientNet - dominated image understanding. They produced global image embeddings: a single vector representing the whole image. These were useful for classification but difficult to align with language, which is inherently sequential and local.

2020: ViT - the breakthrough. Dosovitskiy et al. published "An Image is Worth 16x16 Words" (2020). Their key insight: if you divide an image into a grid of fixed-size patches and flatten each patch into a vector, you get a sequence of patch embeddings. You can then run a standard transformer over that sequence - attention over patches - and achieve state-of-the-art image recognition. The image representation is now a sequence of vectors, not a single global embedding. That is exactly the shape that language models expect.

2021: CLIP. Radford et al. (OpenAI) trained a dual-encoder model - image encoder plus text encoder - using 400 million image-text pairs from the internet. The training objective was contrastive: matched image-text pairs should have similar embeddings; unmatched pairs should have dissimilar embeddings. The result was an image encoder that produced embeddings in a shared semantic space with text. CLIP embeddings became the standard foundation for multimodal models.

2022: Flamingo. Alayrac et al. (DeepMind) published Flamingo, the first model to convincingly demonstrate few-shot visual question answering. Flamingo inserted cross-attention layers into a frozen LLM, allowing image features to attend into every transformer layer. The visual features came from a CLIP-style vision encoder. The LLM backbone was kept frozen; only the cross-attention parameters and a perceiver resampler were trained.

2023: LLaVA, BLIP-2, InstructBLIP. The open-source explosion. LLaVA (Liu et al.) showed that a simple linear projection layer mapping visual tokens to language embedding space, followed by instruction fine-tuning on a synthetic visual instruction dataset, produced surprisingly capable VLMs - trained on consumer hardware in hours, not weeks. BLIP-2 (Li et al.) introduced the Q-Former: a bottleneck transformer with a fixed number of learned query vectors that attend over image patch embeddings.

2024-2025: Maturation. GPT-4V, Claude 3 Vision, Gemini, InternVL, Qwen-VL, Phi-3-Vision - VLMs became production-grade, multi-image capable, and available via API. Video understanding extended the same paradigm to temporal sequences of frames.

The Vision Encoder: ViT Deep Dive

The Vision Transformer is the standard vision encoder in modern VLMs. Understanding it in detail is essential.

Patch Embedding

Given an image of size $H \times W \times C$ (height, width, channels), ViT divides it into a grid of non-overlapping patches of size $P \times P$ . This produces $N = \frac{H \times W}{P^2}$ patches.

Each patch is a tensor of shape $P \times P \times C$ . It gets flattened to a vector of dimension $P^2 \cdot C$ and then projected through a learned linear layer to the model dimension $D$ :

$\mathbf{z}_i = \mathbf{W}_{proj} \cdot \text{flatten}(\text{patch}_i) + \mathbf{b}_{proj}$

The result is a sequence of $N$ patch embeddings, each of dimension $D$ . These look exactly like token embeddings to the transformer that follows.

CLS Token and Positional Encoding

ViT prepends a special learnable [CLS] token to the patch sequence, following the BERT convention. After passing through the transformer, the CLS token's output embedding represents the entire image - it is used as the global image embedding for classification tasks.

Each patch also receives a learnable 1D positional embedding (or 2D in some variants) so the transformer knows where each patch came from spatially. Without this, the model treats the image as a bag of patches with no spatial structure.

Standard ViT Configurations

Model	Patch Size	Layers	Hidden Dim	Heads	Params
ViT-B/16	16x16	12	768	12	86M
ViT-L/16	16x16	24	1024	16	307M
ViT-H/14	14x14	32	1280	16	632M
ViT-G/14	14x14	40	1408	16	1.8B

For a 224x224 image with ViT-B/16: $N = (224/16)^2 = 196$ patches. For a 336x336 image with patch size 14: $N = (336/14)^2 = 576$ patches. Image resolution and patch size directly determine how many visual tokens enter the LLM.

The Alignment Challenge

A ViT trained on ImageNet classification learns to produce image embeddings that are useful for classifying objects. A language model learns to produce token embeddings that are useful for predicting the next token in text. These two embedding spaces are not the same. The core technical challenge of building a VLM is aligning the visual representation space with the language representation space so the LLM can meaningfully process visual tokens.

Three main architectures address this challenge differently.

Three Architecture Patterns

Pattern 1: Cross-Attention Fusion (Flamingo)

Flamingo keeps the LLM frozen and inserts new cross-attention layers between every existing LLM layer. The image tokens (processed by a CLIP ViT and a perceiver resampler that compresses them to 64 fixed-length vectors) attend into each LLM transformer block via these new cross-attention layers.

Strengths: Preserves the LLM's language capabilities exactly. Can inject visual information at every layer. Scales naturally to multiple images.

Weaknesses: The cross-attention layers add parameters and inference cost. The perceiver resampler loses spatial detail by compressing to 64 vectors.

Pattern 2: Projection Layer (LLaVA)

LLaVA takes a much simpler approach. Visual tokens from a CLIP ViT are projected into the LLM's embedding space using a learned MLP (or even just a single linear layer in v1). The projected visual tokens are concatenated with text token embeddings to form a single sequence that the unmodified LLM processes.

This approach is surprisingly effective. The LLM's attention mechanism freely attends over both visual and text tokens without any architectural modification. The entire training cost is the projection layer parameters plus fine-tuning the LLM.

LLaVA training recipe:

Stage 1 - feature alignment: freeze everything except the projection layer. Train on 595K image-text pairs (CC3M filtered). The goal is to learn the projection.
Stage 2 - visual instruction tuning: unfreeze the projection layer and optionally the LLM. Train on 150K GPT-4-generated visual instruction following examples.

Training time for LLaVA-1.5 (Vicuna 13B): about 1 day on 8xA100s. This was a watershed moment - competitive VLM training became accessible.

Pattern 3: Q-Former (BLIP-2)

BLIP-2 introduces a lightweight query transformer - Q-Former - as a bottleneck between the vision encoder and the LLM. The Q-Former has a fixed set of $N_q$ learnable query vectors (typically 32). These queries attend over the full set of image patch tokens via cross-attention. The output of the Q-Former is $N_q$ vectors - a fixed-length, compressed visual representation regardless of the image resolution.

$\text{Visual tokens for LLM} = \text{Q-Former}(\mathbf{Q}_{learned}, \text{ViT-patches})$

This design is efficient: the LLM only ever sees 32 tokens per image, not 256 or 576. The Q-Former learns to distill the most relevant visual information into those 32 positions given the text context. The tradeoff is that some spatial detail is lost.

VLM Architecture Comparison

Image Tokenization: How Many Tokens?

This question matters a great deal in production because image tokens consume context window space and increase cost.

Model / Architecture	Image Tokens
LLaVA-1.5 (336x336 input)	576 tokens
BLIP-2 (Q-Former)	32 tokens
Flamingo (perceiver resampler)	64 tokens
Claude 3 Haiku (standard image)	~1,600 tokens
Claude 3 Sonnet (standard image)	~1,600 tokens
GPT-4V (512x512)	~765 tokens
GPT-4V (high-res tile)	up to 2,048 tokens
InternVL2 (high resolution)	1,024-4,096 tokens

High-resolution support is a critical frontier. Many real-world images - technical diagrams, documents, screenshots - lose crucial detail when downsampled to 336x336. Modern VLMs use dynamic resolution strategies: tile the image into sub-images, process each independently, then concatenate the visual tokens. This is how GPT-4V achieves "high detail" mode and how InternVL2 handles 4K inputs.

Modern VLMs: The Current Landscape

GPT-4V / GPT-4o (OpenAI): Closed model, available via API. GPT-4o adds native audio support. Excellent at reading text in images, spatial reasoning, diagram interpretation.

Claude 3 Vision (Anthropic): Available in Haiku, Sonnet, and Opus tiers. Strong at document understanding, reading dense text, multi-image analysis. Up to 20 images per API call.

Gemini 1.5 Pro (Google): Native multimodal from the ground up. 1M token context window supports video as sequences of frames. Strong at long-video understanding.

LLaVA-1.5 / LLaVA-NeXT (open source): The go-to open-source VLM family. LLaVA-NeXT adds dynamic resolution. Runs on a single A100 at 7B or 13B scale.

InternVL2 (Shanghai AI Lab): Strong open-source alternative, competitive with proprietary models on benchmarks. Available at 2B through 40B.

Qwen-VL (Alibaba): Strong at Chinese and multilingual image understanding. Available open-source.

Phi-3-Vision (Microsoft): Compact 4.2B model with strong vision capabilities. Deployable on edge devices.

Code: Vision Q&A with Claude API

import anthropic
import base64
from pathlib import Path


def encode_image_base64(image_path: str) -> tuple[str, str]:
    """Encode a local image file to base64."""
    path = Path(image_path)
    suffix = path.suffix.lower()
    media_type_map = {
        ".jpg": "image/jpeg",
        ".jpeg": "image/jpeg",
        ".png": "image/png",
        ".gif": "image/gif",
        ".webp": "image/webp",
    }
    media_type = media_type_map.get(suffix, "image/jpeg")
    with open(path, "rb") as f:
        data = base64.standard_b64encode(f.read()).decode("utf-8")
    return data, media_type


def analyze_image(image_path: str, question: str) -> str:
    """Ask Claude a question about a local image."""
    client = anthropic.Anthropic()

    image_data, media_type = encode_image_base64(image_path)

    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": media_type,
                            "data": image_data,
                        },
                    },
                    {
                        "type": "text",
                        "text": question,
                    },
                ],
            }
        ],
    )
    return message.content[0].text


def analyze_image_url(image_url: str, question: str) -> str:
    """Ask Claude a question about an image at a URL."""
    client = anthropic.Anthropic()

    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "url",
                            "url": image_url,
                        },
                    },
                    {
                        "type": "text",
                        "text": question,
                    },
                ],
            }
        ],
    )
    return message.content[0].text


def multi_image_analysis(images: list[str], question: str) -> str:
    """Analyze multiple images in a single call."""
    client = anthropic.Anthropic()

    content = []
    for image_path in images:
        image_data, media_type = encode_image_base64(image_path)
        content.append({
            "type": "image",
            "source": {
                "type": "base64",
                "media_type": media_type,
                "data": image_data,
            },
        })

    content.append({
        "type": "text",
        "text": question,
    })

    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=2048,
        messages=[{"role": "user", "content": content}],
    )
    return message.content[0].text


# Example usage
if __name__ == "__main__":
    # Single image from URL
    result = analyze_image_url(
        "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png",
        "Describe what you see in this image in detail.",
    )
    print("URL Image Analysis:")
    print(result)

    # Token counting for cost estimation
    client = anthropic.Anthropic()
    response = client.messages.count_tokens(
        model="claude-3-5-sonnet-20241022",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "url",
                            "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png",
                        },
                    },
                    {"type": "text", "text": "Describe this image."},
                ],
            }
        ],
    )
    print(f"Token count for this request: {response.input_tokens}")

Code: Open-Source VLM with HuggingFace (LLaVA)

from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
from PIL import Image
import requests
from io import BytesIO


def load_llava_model(model_id: str = "llava-hf/llava-v1.6-mistral-7b-hf"):
    """Load LLaVA-NeXT model and processor."""
    print(f"Loading {model_id}...")
    processor = LlavaNextProcessor.from_pretrained(model_id)
    model = LlavaNextForConditionalGeneration.from_pretrained(
        model_id,
        torch_dtype=torch.float16,
        low_cpu_mem_usage=True,
        device_map="auto",
    )
    print("Model loaded.")
    return processor, model


def load_image(source: str) -> Image.Image:
    """Load image from URL or local path."""
    if source.startswith("http"):
        response = requests.get(source, timeout=10)
        return Image.open(BytesIO(response.content)).convert("RGB")
    else:
        return Image.open(source).convert("RGB")


def run_vqa(
    processor,
    model,
    image_source: str,
    question: str,
    max_new_tokens: int = 512,
) -> str:
    """Run visual question answering with LLaVA."""
    image = load_image(image_source)

    conversation = [
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {"type": "text", "text": question},
            ],
        }
    ]

    prompt = processor.apply_chat_template(
        conversation, add_generation_prompt=True
    )

    inputs = processor(
        images=image,
        text=prompt,
        return_tensors="pt",
    ).to(model.device, torch.float16)

    with torch.no_grad():
        output_ids = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,
        )

    generated_ids = output_ids[:, inputs.input_ids.shape[1]:]
    output = processor.batch_decode(
        generated_ids,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=True,
    )
    return output[0].strip()


def inspect_visual_tokens(processor, image_source: str) -> dict:
    """Inspect how many visual tokens an image produces."""
    image = load_image(image_source)

    conversation = [
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {"type": "text", "text": "Describe this image."},
            ],
        }
    ]
    prompt = processor.apply_chat_template(
        conversation, add_generation_prompt=True
    )
    inputs = processor(images=image, text=prompt, return_tensors="pt")

    total_tokens = inputs.input_ids.shape[1]
    text_tokens = len(processor.tokenizer.encode(prompt.replace("<image>", "")))
    visual_tokens = total_tokens - text_tokens

    return {
        "image_size": image.size,
        "total_input_tokens": total_tokens,
        "estimated_visual_tokens": visual_tokens,
    }


if __name__ == "__main__":
    processor, model = load_llava_model()

    image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"

    answer = run_vqa(
        processor,
        model,
        image_url,
        "What animal is in the image? What color is it? Is it indoors or outdoors?",
    )
    print("LLaVA Answer:", answer)

    token_info = inspect_visual_tokens(processor, image_url)
    print("Token info:", token_info)

Production Engineering Notes

Image Preprocessing Pipeline

Before an image reaches the VLM, it goes through preprocessing that significantly affects both quality and cost:

from PIL import Image
import io
import base64
from typing import Optional


def preprocess_image_for_vlm(
    image_path: str,
    max_size: int = 1568,
    target_format: str = "JPEG",
    quality: int = 85,
    max_bytes: Optional[int] = 5 * 1024 * 1024,
) -> tuple[str, str, dict]:
    """
    Preprocess an image for VLM consumption.
    Returns: (base64_data, media_type, metadata)
    """
    img = Image.open(image_path).convert("RGB")
    original_size = img.size

    # Resize if larger than max dimension
    max_dim = max(img.size)
    if max_dim > max_size:
        scale = max_size / max_dim
        new_size = (int(img.size[0] * scale), int(img.size[1] * scale))
        img = img.resize(new_size, Image.LANCZOS)

    # Compress to stay under byte limit
    buffer = io.BytesIO()
    img.save(buffer, format=target_format, quality=quality, optimize=True)

    if max_bytes and buffer.tell() > max_bytes:
        for q in [75, 65, 55, 45]:
            buffer = io.BytesIO()
            img.save(buffer, format=target_format, quality=q)
            if buffer.tell() <= max_bytes:
                break

    buffer.seek(0)
    image_bytes = buffer.read()
    b64_data = base64.standard_b64encode(image_bytes).decode("utf-8")
    media_type = f"image/{target_format.lower()}"

    metadata = {
        "original_size": original_size,
        "processed_size": img.size,
        "bytes": len(image_bytes),
        "format": target_format,
    }

    return b64_data, media_type, metadata

Resolution Trade-offs

Resolution	Visual Tokens (est.)	Context Cost	Detail Level
336x336	576	Low	Suitable for general scenes
672x672	~2,304	Medium	Documents with medium font
1024x768	~3,500	High	Technical diagrams
1568x1568	~9,600	Very High	Dense text, small figures

The sweet spot for most document understanding tasks is 672-1024px. Going higher costs quadratically more tokens for linear improvements in detail.

Cost Estimation

At Claude 3.5 Sonnet pricing (as of early 2025): input tokens cost roughly $3/M tokens. A standard image at 1,600 tokens costs about$ 0.0048 per image in vision tokens alone. At 1,000 images/day that is $4.80/day - manageable. At 1 million images/day it becomes$ 4,800/day in vision input alone. Plan for this.

Multi-Image Workflows

Many real workflows require multiple images: before/after comparison, product variants, document pages, video frame sequences. Key considerations:

Context window limits: 20 images per Claude call, each consuming 1,600+ tokens. A 20-image call uses ~32,000+ tokens just in images.
Order matters: VLMs process images in order. Put the reference image first, comparison images second.
Explicit references: Do not assume the model tracks "Image 1" vs "Image 2" without labels. Include [Image 1] markers in your prompt.

:::tip Resolution Strategy For documents with mixed content - some pages text-heavy, some diagram-heavy - use dynamic resolution: detect which pages contain figures and send those at higher resolution, text-only pages at lower resolution. This cuts token cost by 30-50% on typical reports. :::

Common Mistakes

:::danger Sending Oversized Images Without Preprocessing Sending a 4K photograph (3840x2160) directly to a VLM API consumes an enormous number of tokens and may hit API limits. Always resize to the model's effective maximum resolution before sending. The extra pixels in a 4K photo contribute almost nothing to VLM understanding - the patch tokenizer will lose the detail anyway - but they cost real money. :::

:::danger Trusting VLM Output for Critical Text Extraction VLMs can hallucinate text they "read" from images. For critical use cases - extracting invoice amounts, contract clauses, medical dosages - always combine VLM understanding with deterministic OCR. Use the VLM for comprehension and structure; use OCR for verbatim extraction. Cross-validate the two outputs. :::

:::warning Forgetting Image Token Cost in Context Window Planning It is easy to design a system that works fine in testing (1-2 images) and runs out of context window in production (10-15 images from a multi-page document). Budget image tokens explicitly. A simple rule: treat each image as consuming 2,000 tokens when estimating whether a request fits in context. :::

:::warning Sending PNG When JPEG Would Do PNG is lossless but typically 5-10x larger than JPEG at equivalent visual quality. Most VLM use cases do not require lossless compression. Convert to JPEG at quality 80-90 before encoding. Exception: images with text on solid backgrounds (screenshots, diagrams) can deteriorate visibly at high JPEG compression - use PNG for those or quality 95+. :::

Interview Questions and Answers

Q1: Walk me through how a VLM processes an image. What happens at each stage?

An image is first divided into fixed-size patches (e.g., 14x14 or 16x16 pixels). Each patch is flattened and projected through a linear layer to the model's hidden dimension, producing one patch embedding per patch. A learnable [CLS] token is prepended and learned 1D positional embeddings are added to all positions. This sequence of patch embeddings goes through a standard transformer - the Vision Transformer (ViT) - which uses self-attention to build contextually-aware representations of each patch. The output patch embeddings are then passed through an alignment module - either a linear projection (LLaVA), a Q-Former (BLIP-2), or cross-attention layers (Flamingo) - to project them into the language model's embedding space. Finally, these visual token embeddings are concatenated with the text token embeddings, and the full sequence goes through the LLM.

Q2: What is the alignment problem in VLMs and how do different architectures solve it?

The alignment problem is that a vision encoder trained on image classification tasks produces embeddings in a visual semantic space, while an LLM operates in a linguistic semantic space. These two spaces are not naturally compatible. Three approaches solve this: (1) Linear projection (LLaVA): a learned MLP maps visual embeddings to the LLM's embedding dimension. Simple, effective, trained with instruction fine-tuning. (2) Q-Former (BLIP-2): learned query vectors attend over image patches and produce a fixed-length representation. Efficient but loses spatial detail. (3) Cross-attention fusion (Flamingo): new cross-attention layers inserted into a frozen LLM let text tokens attend over image features at every layer. Preserves LLM capability but adds architectural complexity.

Q3: How many tokens does an image consume in a VLM, and why does it matter?

Token count depends on image resolution and the VLM architecture. For a 336x336 image with 16x16 patches: $(336/16)^2 = 441$ patches before any compression. With a Q-Former it might be compressed to 32 tokens; with a direct projection (LLaVA) it remains 576 tokens; Claude processes standard images at roughly 1,600 tokens. This matters because cost scales linearly with tokens, context window consumption limits how many images you can send per call, and inference latency scales with total sequence length.

Q4: Compare the Flamingo, LLaVA, and BLIP-2 architectures. When would you choose each?

Flamingo: Cross-attention fusion at every LLM layer. Best for few-shot visual tasks and when preserving LLM language quality is paramount. Production cost is higher due to extra cross-attention parameters. LLaVA: Simple projection layer, minimal architecture change. Best when you want an open-source model you can fine-tune for a specific visual domain on limited compute. Fast to train, easy to serve. BLIP-2: Q-Former bottleneck gives a compact fixed-length representation (32 tokens). Best when you need to minimize context window consumption or are working with a very large LLM. The Q-Former is lightweight but loses spatial detail. For production API usage (Claude, GPT-4V) you do not control the architecture - focus on resolution, prompting, and preprocessing.

Q5: What is the difference between how a VLM "sees" an image and how a human sees an image?

A VLM processes images through a fixed resolution, patch-based tokenizer that converts pixels to embeddings without any concept of focus, saccades, or attention in the biological sense. It sees all parts of the image simultaneously at uniform resolution - it does not "look" at interesting areas first the way humans do. Humans have a fovea providing high-resolution central vision and lower-resolution peripheral vision; we make rapid eye movements (saccades) to bring areas of interest into focus. VLMs have no equivalent of foveal processing at inference time - everything is treated uniformly. VLMs also lack the embodied priors humans use for 3D depth perception, physical intuition, and causal reasoning from visual cues. On the other hand, VLMs have seen vastly more images during training than any human and can pattern-match across a much larger visual vocabulary.

Q6: How would you fine-tune a VLM for a domain-specific task like medical imaging?

The standard approach follows the LLaVA two-stage recipe but adapted for your domain. Stage 1: keep the ViT and LLM frozen, train only the projection layer on a dataset of medical image-caption pairs (radiology reports, pathology descriptions). This aligns the medical visual domain to the language space. Stage 2: unfreeze the projection layer and the LLM, fine-tune on a curated dataset of medical visual QA pairs. Use LoRA on the LLM to reduce trainable parameters and prevent catastrophic forgetting. Key data considerations: medical images (X-rays, MRIs, histology slides) have very different statistics from natural images - the ViT may need unfreezing too if it was only trained on natural images. Evaluate on held-out cases with clinician review, not just automated metrics. Always validate hallucination rates carefully - in medical contexts hallucinated findings are dangerous.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the CLIP Contrastive Learning demo on the EngineersOfAI Playground - no code required.

:::

The System That Changed How We Thought About Documents​

Why Language-Only Models Cannot See​

Historical Context: From ConvNets to VLMs​

The Vision Encoder: ViT Deep Dive​

Patch Embedding​

CLS Token and Positional Encoding​

Standard ViT Configurations​

The Alignment Challenge​

Three Architecture Patterns​

Pattern 1: Cross-Attention Fusion (Flamingo)​

Pattern 2: Projection Layer (LLaVA)​

Pattern 3: Q-Former (BLIP-2)​

VLM Architecture Comparison​

Image Tokenization: How Many Tokens?​

Modern VLMs: The Current Landscape​

Code: Vision Q&A with Claude API​

Code: Open-Source VLM with HuggingFace (LLaVA)​

Production Engineering Notes​

Image Preprocessing Pipeline​

Resolution Trade-offs​

Cost Estimation​

Multi-Image Workflows​

Common Mistakes​

Interview Questions and Answers​