Skip to main content

Multimodal Open Source Models

The Document Processing Team That Changed Its Entire Stack

In early 2024, the data engineering team at a mid-size insurance company was spending roughly 40 engineer-hours per week on a single pipeline: extracting structured information from scanned claim forms, handwritten medical records, and PDF attachments. Their stack was a brittle combination of Tesseract OCR, a rules-based field extractor, and a human review queue that handled everything the rules missed. On a good week, the automation handled 72% of documents end-to-end. On a bad week, after a new form type appeared or a batch of low-quality scans arrived, it dropped to 45%.

The lead engineer, Tariq, had been watching the LLaVA paper with interest since it appeared on arXiv in late 2023. He was skeptical. His OCR pipeline was fast, deterministic, and auditable - the kind of system that insurance regulators could understand and approve. A neural network that looked at images and returned JSON seemed like the wrong tool: opaque, unpredictable, hard to explain to a compliance team.

Tariq ran a two-week experiment anyway. He took 500 claim forms that had required human review in the previous month - the hardest 500, the ones the rules-based system failed on. He ran each through LLaVA-1.5 with a structured extraction prompt. The model correctly extracted all required fields on 71% of documents the first try, without any fine-tuning, without any form-specific rules, without months of engineering work. Not better than the fully-tuned Tesseract pipeline on easy documents. But dramatically better on hard ones: rotated scans, handwritten annotations, mixed-language forms, and novel form layouts the rules had never seen.

By the end of the quarter, Tariq's team had replaced the human review queue entirely for one document category, reducing processing time from 4 hours to 8 minutes. The compliance team required three months of output auditing before approving the model for production - a legitimate requirement that Tariq planned for. But the fundamental insight was locked in: vision-language models had crossed a threshold where they could handle the hard tail of real-world document processing that rules-based OCR could not.

This lesson explains how vision-language models work from the architecture up, why the open-source ecosystem has advanced so rapidly since 2023, and what you need to know to deploy these models for practical document understanding, image analysis, and visual reasoning tasks in production.


Why This Exists - The Gap Between Vision and Language

The Pre-Multimodal Problem

For most of the 2010s, computer vision and natural language processing were entirely separate fields. Vision models (ResNet, EfficientNet, DINO) produced fixed-dimensional embeddings from images. Language models produced text. Connecting them required custom architectures for each specific task: image captioning, visual question answering, and image-text retrieval all used different model families with different training procedures.

This specialization created a fragmented ecosystem. Building a system that needed to understand an image AND answer questions about it AND generate structured output from it required combining three separate models, each with its own fine-tuning regime, serving infrastructure, and failure mode. The engineering overhead was high. The brittleness was higher.

More fundamentally: the vision models and language models did not share a representation space. A vision model producing a 2048-dimensional embedding had no natural way to communicate with a language model that operated on token sequences. Every attempt to connect them required a custom "glue" layer, and that glue was usually the most fragile part of the system.

What Changed: Large-Scale Contrastive Pre-training

The first major breakthrough was CLIP (Contrastive Language-Image Pre-training, Radford et al., OpenAI, 2021). CLIP trained a vision encoder and a text encoder jointly on 400 million image-text pairs scraped from the web. The training objective was contrastive: given a batch of image-text pairs, the model learns to maximize the similarity between matched pairs and minimize similarity between mismatched pairs.

LCLIP=1Ni=1N[logexp(sim(vi,ti)/τ)j=1Nexp(sim(vi,tj)/τ)+logexp(sim(ti,vi)/τ)j=1Nexp(sim(tj,vi)/τ)]\mathcal{L}_{\text{CLIP}} = -\frac{1}{N}\sum_{i=1}^{N} \left[\log \frac{\exp(\text{sim}(v_i, t_i) / \tau)}{\sum_{j=1}^{N} \exp(\text{sim}(v_i, t_j) / \tau)} + \log \frac{\exp(\text{sim}(t_i, v_i) / \tau)}{\sum_{j=1}^{N} \exp(\text{sim}(t_j, v_i) / \tau)}\right]

where viv_i and tit_i are the ii-th image and text embeddings, sim\text{sim} is cosine similarity, and τ\tau is a learned temperature parameter.

The result was a vision encoder that produced embeddings in the same semantic space as text. An image of a dog produced embeddings close to the text "a photo of a dog" in the shared representation space. This was the key that unlocked multimodal reasoning: once vision and language share a representation space, connecting them becomes a much simpler problem.

The Instruction-Following Gap

CLIP produced aligned representations, but it was not a generative model. You could use it to retrieve images matching a text query, but you could not ask it to describe an image, answer questions about it, or extract structured information from it. That required combining the CLIP vision encoder with a large generative language model.

The problem: language models expect token sequences as input. CLIP produces continuous embedding vectors. How do you bridge this interface? This is the core architectural question that vision-language models (VLMs) answer, and the different approaches taken by different model families represent genuine technical tradeoffs with production implications.


Historical Context - The LLaVA Moment

Visual Instruction Tuning (Liu et al., 2023)

The paper that kicked off the open-source VLM explosion was LLaVA: Visual Instruction Tuning (Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee, 2023). The "aha moment" was the realization that you did not need a sophisticated new architecture to create a capable vision-language model. You needed three things:

  1. A frozen CLIP vision encoder (ViT-L/14 from OpenAI)
  2. A simple linear projection layer to map vision embeddings into the language model's token space
  3. A language model (LLaMA) fine-tuned on synthetic visual instruction following data

The synthetic data generation was the clever part. Liu et al. used GPT-4 to generate instruction-following conversations about images. They gave GPT-4 the image captions and bounding box annotations from COCO (not the images themselves, because GPT-4 Vision was not publicly available at the time), and asked GPT-4 to generate questions and answers that a visual assistant might produce. The resulting 158K instruction-following samples, combined with 595K image-caption pairs, were enough to train a capable VLM from LLaMA-7B.

LLaVA achieved surprising performance on visual reasoning benchmarks despite the minimal architecture. The key insight: the CLIP vision encoder already understood images. The language model already understood instructions. The only thing needed was a small learned interface between them - the linear projection layer had roughly 4 million parameters, negligible compared to the models on either side of it.

The Rapid Iteration: LLaVA-1.5 and Beyond

LLaVA-1.5 (Liu et al., late 2023) replaced the linear projection with a 2-layer MLP, upgraded to CLIP ViT-L/14@336px for higher resolution, and used Vicuna-13B as the base language model. These changes plus improved training data produced a model that outperformed many proprietary models on standard VQA benchmarks while training in under a week on 8 A100s.

The broader lesson: the barrier to building capable VLMs had dropped dramatically. The expensive part - the CLIP encoder and the LLM - were both available as open-source pretrained models. The novel training required only a projection layer and a relatively small instruction-tuning dataset.

This triggered a wave of open-source VLMs in 2024: LLaVA-NeXT (higher resolution, better instruction following), InternVL (stronger vision backbone), Idefics2 (better document understanding), Qwen2-VL (dynamic resolution and video), and ultimately LLaMA 3.2 Vision (Meta integrating vision natively into the LLaMA 3.2 family at 11B and 90B scales).


Core Architecture: How Vision-Language Models Work

The Three-Component Architecture

Every modern open-source VLM follows the same basic structure:

  1. Vision Encoder: A transformer-based vision model (usually CLIP ViT or SigLIP) that processes the input image and produces a sequence of visual feature vectors, one per image patch.

  2. Projection Layer: A learned network (linear layer, MLP, or cross-attention module) that maps visual feature vectors into the same dimensionality as the language model's token embeddings.

  3. Language Model Decoder: A standard autoregressive LLM (LLaMA, Mistral, Qwen, InternLM) that receives a combined sequence of visual tokens (from the projection layer) and text tokens (from the tokenizer), and generates text output autoregressively.

How Image Tokens Are Created

The vision encoder processes images as sequences of patches. For ViT-L/14 at 336px input resolution: the image is divided into 336/14=24336 / 14 = 24 patches per dimension, giving 24×24=57624 \times 24 = 576 patches total. Each patch is encoded as a 1024-dimensional vector, producing a 576×1024576 \times 1024 feature matrix.

The projection layer maps each of these 576 vectors to the LLM's embedding dimension (e.g., 4096 for a 7B model). The result is 576 visual tokens that the language model treats identically to text tokens in its attention mechanism.

This is the key insight: once the projection layer is trained, the language model does not need to know that some tokens came from an image. They are just tokens in the sequence. The attention mechanism naturally learns to attend to visual tokens when they are relevant to the current generation step.

The number 576 is significant. It means every image adds 576 tokens to the context before the text prompt even begins. For a model with a 4096-token context window, this leaves only 3520 tokens for text. For a model with a 128K context window, it is negligible. Context window length matters more for VLMs than for text-only models.

Dynamic Resolution: Handling Images of Varying Sizes

The fixed 336px input of early LLaVA models was a significant limitation. Real-world documents come in various sizes and aspect ratios. A 2400x800 resolution bank statement cannot be meaningfully compressed to a 336x336 square without losing critical fine-grain text.

LLaVA-NeXT (also called LLaVA-1.6) introduced dynamic high resolution via a tiling approach. The input image is:

  1. Resized and tiled into multiple 336px panels: up to 4 panels for a 2x2 grid, or 6 panels in other configurations.
  2. Each panel is encoded independently by the CLIP encoder, producing 576 tokens per panel.
  3. A downsampled "thumbnail" view is also encoded to preserve global context.
  4. All panel tokens plus thumbnail tokens are concatenated before the text prompt.

A 4-panel split produces 4×576+576=28804 \times 576 + 576 = 2880 visual tokens. This is expensive in context, but dramatically improves fine-grained text reading, small object detection, and high-resolution document understanding.

Qwen2-VL takes dynamic resolution further with a native resolution approach: the model accepts images at their natural resolution and processes them with a variable number of tokens. Rather than tiling, it uses a 2D RoPE position encoding that provides actual spatial position information to each visual token. This is the state-of-the-art approach as of 2024-2025 and produces significantly better spatial reasoning.

The key formula for Qwen2-VL visual token count is:

Ntokens=H×W142×4N_{\text{tokens}} = \left\lfloor \frac{H \times W}{14^2 \times 4} \right\rfloor

where HH and WW are the image dimensions and the factor of 4 is a compression ratio from the visual embedding stride. A 1024x1024 image produces roughly 1300 visual tokens. The model handles any aspect ratio without quality loss from forced resizing.


The Model Landscape

VLM Architecture Comparison

Model Sizes and Capability Tiers

ModelParametersContextVision TokensKey Strength
LLaVA-1.5-7B7B4K576Fast, low memory, good baseline
LLaVA-NeXT-13B13B4K2880Better resolution via tiling
InternVL2-8B8B8KDynamicBest-in-class OCR at 8B
InternVL2-26B26B8KDynamicStrong document understanding
Qwen2-VL-7B7B128KDynamicVideo, native resolution, strong multilingual
Qwen2-VL-72B72B128KDynamicGPT-4V competitive, open weights
LLaMA 3.2 Vision 11B11B128KCross-attentionMeta ecosystem, Ollama support
LLaMA 3.2 Vision 90B90B128KCross-attentionStrongest open VLM for general tasks
Idefics2-8B8B8K64 (resampled)Efficient, fewer visual tokens

The LLaMA 3.2 Vision Architecture Difference

LLaMA 3.2 Vision takes a different architectural approach from the LLaVA-style models. Rather than concatenating visual tokens into the main sequence, it uses a cross-attention mechanism at specific layers of the language model.

The visual encoder output is kept in a separate buffer. Cross-attention layers interleaved throughout the LLM attend to this visual buffer when processing text tokens. The advantage: visual information is available throughout the generation process without occupying the primary sequence context. The visual tokens do not count against the 128K context window.

The tradeoff: cross-attention requires training the LLM from scratch with cross-attention layers included in the architecture. You cannot simply take an existing LLaMA 3.1 base and add vision via a projection layer. This is why Meta released LLaMA 3.2 Vision as a distinct model family rather than an adapter on top of LLaMA 3.1.


Video Understanding

From Images to Video

Extending VLMs to video is conceptually straightforward but computationally challenging. A video is a sequence of frames. Process each frame as an image, concatenate the resulting visual tokens, prepend the text prompt. The model can then reason about temporal relationships between frames.

The challenge: a 30-second video at 1 FPS produces 30 images. At 576 tokens per image (LLaVA), that is 17,280 visual tokens before the text prompt. Most LLMs cannot handle this without running out of context or becoming extremely slow.

Current open-source solutions:

Frame sampling: Sample 8-32 frames from the video, spaced uniformly or by visual difference. Use these as the visual context. Works well for slow-moving content; misses fast action sequences.

Temporal compression: Qwen2-VL's 3D temporal fusion module represents multiple frames as a single set of visual tokens by merging temporal neighbors. A video with 16 sampled frames might produce only 4x the tokens of a single image rather than 16x.

Memory-augmented approaches: For very long videos (>5 minutes), maintain a rolling memory of compressed visual summaries. This is an active research area without a widely deployed open-source solution yet.

InternVL2-Video and Qwen2-VL are the strongest open-source options for video understanding as of early 2025. Both support multi-frame input with temporal reasoning.


Audio Models: Whisper and SeamlessM4T

Whisper (OpenAI, 2022)

While not a vision-language model, Whisper belongs in the open-source multimodal ecosystem as the dominant open-source speech recognition model. Whisper is an encoder-decoder transformer trained on 680,000 hours of multilingual audio data.

Architecture: the audio input is converted to an 80-channel log-Mel spectrogram with 25ms windows at 10ms stride. This is processed by a CNN feature extractor followed by a transformer encoder. The encoder output is fed to a transformer decoder that generates text autoregressively.

Whisper supports 99 languages with varying quality, performs voice activity detection, language identification, and timestamp generation as unified capabilities. The decoder can be prompted with language tokens to improve performance on specific languages.

Available sizes: tiny (39M), base (74M), small (244M), medium (769M), large-v3 (1.5B). For production transcription at quality comparable to commercial APIs, large-v3 with faster-whisper (CTranslate2-optimized) is the standard choice.

SeamlessM4T (Meta, 2023)

SeamlessM4T (Massively Multilingual Multimodal Machine Translation) is Meta's open-source model for speech-to-text, text-to-speech, and speech-to-speech translation across 100+ languages. Where Whisper focuses on ASR, SeamlessM4T addresses the full translation stack.

For practical applications: SeamlessM4T is the best open-source option when you need multilingual speech processing beyond just transcription - particularly for building audio pipelines that handle non-English input without routing through a cloud translation API.


Code Examples

Running LLaMA 3.2 Vision with Transformers

from transformers import MllamaForConditionalGeneration, AutoProcessor
from PIL import Image
import requests
import torch


def load_llama_vision_model(model_id: str = "meta-llama/Llama-3.2-11B-Vision-Instruct"):
"""
Load LLaMA 3.2 Vision model and processor.
Requires ~22GB VRAM for 11B in float16, or ~11GB with 8-bit.
"""
processor = AutoProcessor.from_pretrained(model_id)
model = MllamaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto",
)
return model, processor


def analyze_image(
model,
processor,
image_source: str, # URL or local file path
question: str,
max_new_tokens: int = 1024,
) -> str:
"""
Analyze an image with a natural language question.

Args:
model: Loaded MllamaForConditionalGeneration model
processor: Loaded AutoProcessor
image_source: URL string or local file path string
question: Question to ask about the image
max_new_tokens: Maximum tokens to generate

Returns:
Generated text response
"""
# Load image from URL or local path
if image_source.startswith("http"):
image = Image.open(requests.get(image_source, stream=True).raw)
else:
image = Image.open(image_source)

# LLaMA 3.2 Vision uses a specific message format with image placeholders
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": question},
],
}
]

# Apply chat template and process image together
input_text = processor.apply_chat_template(
messages,
add_generation_prompt=True,
)

inputs = processor(
image,
input_text,
return_tensors="pt",
add_special_tokens=False,
).to(model.device)

output = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=0.1,
do_sample=True,
)

# Decode only new tokens (skip the input)
generated_tokens = output[0][inputs["input_ids"].shape[1]:]
return processor.decode(generated_tokens, skip_special_tokens=True)


# Example: document understanding
model, processor = load_llama_vision_model()

response = analyze_image(
model,
processor,
image_source="/path/to/invoice.jpg",
question=(
"Extract the following fields from this invoice as JSON: "
"invoice_number, date, vendor_name, total_amount, line_items (list of "
"{description, quantity, unit_price, total}). "
"If a field is not visible, use null."
),
)
print(response)

Document OCR and Structured Extraction with InternVL2

from transformers import AutoTokenizer, AutoModel
from PIL import Image
import torch
import numpy as np
import json
import re


def load_internvl2(model_name: str = "OpenGVLab/InternVL2-8B"):
"""
Load InternVL2 model. Strong for OCR and document understanding.
8B model needs ~16GB VRAM in float16.
"""
model = AutoModel.from_pretrained(
model_name,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
trust_remote_code=True,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
return model, tokenizer


def preprocess_image_for_internvl2(image_path: str, max_tiles: int = 6) -> torch.Tensor:
"""
Preprocess image with dynamic tiling for high-resolution documents.
More tiles = better quality, more context tokens, slower inference.
"""
from transformers import AutoImageProcessor

processor = AutoImageProcessor.from_pretrained(
"OpenGVLab/InternVL2-8B",
trust_remote_code=True,
)

image = Image.open(image_path).convert("RGB")

# Dynamic resize - InternVL2 will tile images larger than 448x448
pixel_values = processor(
images=image,
return_tensors="pt",
max_num=max_tiles, # Maximum number of tiles to use
).pixel_values

return pixel_values


def extract_structured_data_from_document(
model,
tokenizer,
image_path: str,
extraction_schema: dict,
max_new_tokens: int = 2048,
) -> dict:
"""
Extract structured data from a document image using InternVL2.

Args:
model: Loaded InternVL2 model
tokenizer: Loaded tokenizer
image_path: Path to document image
extraction_schema: Dict describing fields to extract
max_new_tokens: Max tokens for response

Returns:
Parsed JSON dict with extracted fields
"""
pixel_values = preprocess_image_for_internvl2(image_path).to(
model.device, dtype=torch.float16
)

# Build extraction prompt from schema
schema_description = "\n".join(
f"- {field}: {description}"
for field, description in extraction_schema.items()
)

prompt = f"""<image>
Analyze this document and extract the following information.
Return a valid JSON object with these exact field names.
Use null for any field not found in the document.

Fields to extract:
{schema_description}

Return only the JSON object, no explanation."""

generation_config = {
"max_new_tokens": max_new_tokens,
"do_sample": False, # Greedy for structured extraction
"pad_token_id": tokenizer.eos_token_id,
}

response = model.chat(
tokenizer,
pixel_values,
prompt,
generation_config,
)

# Parse JSON from response
# Models sometimes wrap JSON in code blocks
json_match = re.search(r"```(?:json)?\n?(.*?)\n?```", response, re.DOTALL)
if json_match:
json_str = json_match.group(1)
else:
json_str = response.strip()

try:
return json.loads(json_str)
except json.JSONDecodeError:
# Return raw response if JSON parsing fails
return {"raw_response": response, "parse_error": True}


# Example: extract data from a business registration document
schema = {
"company_name": "Legal name of the company",
"registration_number": "Business registration or incorporation number",
"registration_date": "Date of registration (ISO 8601 format if possible)",
"registered_address": "Full registered address",
"directors": "List of director names",
"share_capital": "Total authorized share capital with currency",
}

# model, tokenizer = load_internvl2()
# result = extract_structured_data_from_document(
# model, tokenizer,
# image_path="/path/to/business_reg.jpg",
# extraction_schema=schema,
# )
# print(json.dumps(result, indent=2))

Evaluating VLM Quality on Custom Tasks

from pathlib import Path
from dataclasses import dataclass, field
from typing import Optional
import json
import re


@dataclass
class VLMTestCase:
"""A single evaluation case for a VLM."""
image_path: str
question: str
expected_answer: str
# Optional: function to check if generated answer is correct
# If None, uses exact string match (not recommended)
answer_checker: Optional[callable] = None
task_category: str = "general"


def default_answer_checker(expected: str, generated: str) -> bool:
"""
Default answer checker: normalized string match.
For production evaluation, replace with task-specific logic.
"""
expected_normalized = expected.lower().strip().rstrip(".")
generated_normalized = generated.lower().strip().rstrip(".")
return expected_normalized in generated_normalized


def numeric_answer_checker(tolerance: float = 0.01):
"""Factory for numeric answer checkers with tolerance."""
def checker(expected: str, generated: str) -> bool:
# Extract first number from each string
expected_num = re.search(r"[\d,]+\.?\d*", expected.replace(",", ""))
generated_num = re.search(r"[\d,]+\.?\d*", generated.replace(",", ""))

if not expected_num or not generated_num:
return False

try:
e = float(expected_num.group().replace(",", ""))
g = float(generated_num.group().replace(",", ""))
return abs(e - g) / max(abs(e), 1e-8) <= tolerance
except ValueError:
return False

return checker


def json_field_checker(required_fields: list[str]):
"""Factory for JSON extraction checkers - verifies required fields are present."""
def checker(expected: str, generated: str) -> bool:
try:
# Try to parse JSON from generated text
json_match = re.search(r"\{.*\}", generated, re.DOTALL)
if not json_match:
return False
parsed = json.loads(json_match.group())
# Check all required fields are present and non-null
return all(
field in parsed and parsed[field] is not None
for field in required_fields
)
except (json.JSONDecodeError, KeyError):
return False

return checker


def run_vlm_evaluation(
model,
processor_or_tokenizer,
test_cases: list[VLMTestCase],
model_family: str = "llava", # "llava", "internvl2", "llama_vision"
) -> dict:
"""
Run evaluation across a list of VLM test cases.

Returns accuracy by task category and overall metrics.
"""
results = []
category_results = {}

for case in test_cases:
# Generate answer using appropriate model interface
if model_family == "llama_vision":
generated = analyze_image(
model, processor_or_tokenizer,
case.image_path, case.question,
)
elif model_family == "internvl2":
# Simplified - in practice use the full internvl2 inference pipeline
generated = "[internvl2 inference]"
else:
generated = "[other model inference]"

# Check correctness
checker = case.answer_checker or default_answer_checker
if case.answer_checker:
correct = checker(case.expected_answer, generated)
else:
correct = default_answer_checker(case.expected_answer, generated)

result = {
"task_category": case.task_category,
"correct": correct,
"generated": generated,
"expected": case.expected_answer,
}
results.append(result)

# Track by category
if case.task_category not in category_results:
category_results[case.task_category] = {"correct": 0, "total": 0}
category_results[case.task_category]["total"] += 1
if correct:
category_results[case.task_category]["correct"] += 1

overall_accuracy = sum(r["correct"] for r in results) / len(results)
category_accuracy = {
cat: data["correct"] / data["total"]
for cat, data in category_results.items()
}

return {
"overall_accuracy": overall_accuracy,
"category_accuracy": category_accuracy,
"n_test_cases": len(test_cases),
"results": results,
}


# Example test cases for an invoice processing system
test_cases = [
VLMTestCase(
image_path="/data/test/invoice_001.jpg",
question="What is the total amount due on this invoice?",
expected_answer="1247.50",
answer_checker=numeric_answer_checker(tolerance=0.005),
task_category="numeric_extraction",
),
VLMTestCase(
image_path="/data/test/invoice_002.jpg",
question="Extract invoice_number, vendor_name, and total_amount as JSON.",
expected_answer='{"invoice_number": "INV-2024-0892", "vendor_name": "Acme Corp", "total_amount": 3400.00}',
answer_checker=json_field_checker(["invoice_number", "vendor_name", "total_amount"]),
task_category="structured_extraction",
),
]

Running Whisper for Transcription

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from pathlib import Path


def build_whisper_pipeline(
model_id: str = "openai/whisper-large-v3",
device: str = "auto",
use_flash_attention: bool = False,
) -> pipeline:
"""
Build a Whisper transcription pipeline.

For production throughput, prefer faster-whisper (CTranslate2) over
this HuggingFace pipeline. Use this for prototyping.
"""
device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if device == "cuda" else torch.float32

model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
torch_dtype=torch_dtype,
low_cpu_mem_usage=True,
use_safetensors=True,
attn_implementation="flash_attention_2" if use_flash_attention else "eager",
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

whisper_pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch_dtype,
device=device,
)

return whisper_pipe


def transcribe_audio(
pipe,
audio_path: str,
language: str = None, # None = auto-detect
return_timestamps: bool = True,
) -> dict:
"""
Transcribe an audio file.

Returns dict with:
- text: full transcript
- chunks: list of {text, timestamp} if return_timestamps=True
- language: detected or specified language code
"""
generate_kwargs = {}
if language:
generate_kwargs["language"] = language

result = pipe(
audio_path,
generate_kwargs=generate_kwargs,
return_timestamps=return_timestamps,
chunk_length_s=30, # Process in 30s chunks for long audio
stride_length_s=5, # 5s overlap between chunks
)

return result


# Usage
# pipe = build_whisper_pipeline()
# result = transcribe_audio(pipe, "/path/to/call_recording.mp3")
# print(result["text"])

Production Engineering Notes

Memory Management for VLMs

VLMs are memory-intensive because they load both a vision encoder and a language model. A typical 7B VLM with CLIP ViT-L uses:

  • Vision encoder: ~1.2GB (CLIP ViT-L at float16)
  • Projection layer: negligible
  • Language model: ~14GB (7B at float16)
  • Total: ~15.2GB

For an 11B VLM like LLaMA 3.2 Vision:

  • Vision encoder: ~1.5GB
  • Cross-attention adapters: ~1GB
  • Language model: ~22GB
  • Total: ~24.5GB

4-bit quantization (AWQ or GPTQ) reduces LLM memory roughly 4x. Vision encoder quantization is less common and more quality-sensitive. In practice: quantize the LLM, keep the vision encoder at float16.

Batching Images: Why It Is Harder Than Batching Text

Text batching is straightforward: pad all sequences to the same length, process as a matrix. Image batching is more complex because:

  1. Images have different resolutions. Fixed-resolution models can batch easily, but dynamic-resolution models (Qwen2-VL, InternVL2) produce variable numbers of visual tokens per image.

  2. Dynamic resolution models require either padding visual tokens to the same length (wastes compute) or using sequence packing (complex implementation).

For high-throughput production serving, the standard approach is:

  • Group images into resolution buckets (e.g., group by closest 336px multiple)
  • Process each bucket as a batch
  • For dynamic resolution models, use sequence packing with attention masks

vLLM and SGLang both have native multimodal batching support as of 2024. For new production deployments, these serving frameworks handle the complexity automatically.

Quality vs Latency Tradeoffs

For document OCR tasks, a rough hierarchy of latency vs quality:

ApproachLatency (per page)Quality on hard docs
Tesseract OCR (traditional)~200msLow on degraded/handwritten
LLaVA-1.5-7B (low res)~1.5sGood for clean docs, poor for small text
InternVL2-8B (4 tiles)~4sExcellent for mixed/degraded docs
LLaMA 3.2 Vision 11B~6sVery good general understanding
Qwen2-VL-72B~15-25sState-of-the-art open-source quality

These numbers assume a single A100 80GB. Latency scales roughly linearly with visual token count, so high-resolution processing (more tiles) directly increases latency. Production systems frequently offer multiple quality tiers and route documents based on initial confidence estimates.

Prompt Engineering for Document Extraction

VLMs respond differently to different prompt formats for structured extraction tasks. Several patterns that consistently improve extraction quality:

  1. Explicit null handling: "If a field is not visible in the document, return null rather than guessing." Models have a strong tendency to hallucinate plausible-looking values for missing fields without this instruction.

  2. Format specification: "Return exactly one JSON object with no markdown formatting, no explanation, no code blocks." Reduces post-processing complexity.

  3. Confidence signals: "If you are uncertain about a value, add a _confidence field with values 'high', 'medium', or 'low'." Surfaces uncertainty for human review routing.

  4. Anchor to visible elements: "Only extract information that is directly visible in this document. Do not infer values from context." Reduces hallucination on documents where some expected fields are genuinely absent.


Common Mistakes

:::danger Trusting VLM Output Without Validation for High-Stakes Extraction VLMs hallucinate. They will confidently generate a plausible-looking invoice number, account number, or address when the actual value is partially obscured or ambiguous. For any application where incorrect extraction has material consequences (financial processing, medical records, legal documents), implement confidence scoring and human review routing. Never deploy a VLM extraction pipeline without a validation step. :::

:::danger Ignoring Context Token Budget for High-Resolution Images A 4-tile LLaVA-NeXT image produces 2880 visual tokens before the text prompt begins. On a 4096-token context model, that leaves only 1216 tokens for prompt and output. Complex extraction prompts easily exceed this. Always compute the visual token count for your expected image resolutions and verify that prompt + expected output fits in the remaining context. The model will not warn you when it truncates. :::

:::warning Using LLaVA-Style Models for Fine-Grained Text Recognition LLaVA-1.5 and LLaVA-NeXT use 336px base resolution, which is insufficient for reading small text in documents. License plates, serial numbers, fine-print terms, and small-font tables in scanned PDFs will be misread or hallucinated. For OCR-intensive tasks, use InternVL2 (specifically designed for high-resolution document reading) or Qwen2-VL with high resolution settings, not the LLaVA family. :::

:::warning Assuming Chat Template Compatibility Between VLM Families Each VLM family uses a different conversation format, image placeholder syntax, and special tokens. LLaMA 3.2 Vision uses {"type": "image"} dict syntax in messages. LLaVA uses <image> text tokens. InternVL2 uses <image> with trust_remote_code=True. Mixing up these formats produces either errors or silently degraded output where the model ignores the image. Always use tokenizer.apply_chat_template() with the correct processor for each model. :::

:::warning Running Whisper on Long Audio Without Chunking Whisper's encoder has a fixed 30-second receptive window. Audio longer than 30 seconds must be processed in chunks with overlap. Without chunking, the model will either truncate input, generate hallucinated text, or produce degraded output on long recordings. The HuggingFace pipeline handles this automatically with chunk_length_s=30, stride_length_s=5. Standalone implementations must implement chunking explicitly. :::


Interview Q&A

Q1: Explain the three-component architecture of a vision-language model. What is each component's role?

A vision-language model has three main components: a vision encoder, a projection layer, and a language model decoder.

The vision encoder processes the raw image into a sequence of feature vectors. In most open-source VLMs (LLaVA, InternVL2), this is a Vision Transformer (ViT) pre-trained with CLIP contrastive objectives. The image is divided into fixed-size patches (typically 14x14 or 16x16 pixels), each patch is encoded as a vector, and the resulting sequence of vectors captures visual features at different spatial locations in the image.

The projection layer is a learned network (ranging from a single linear layer in early LLaVA to a 2-layer MLP to cross-attention modules in later models) that maps vision encoder output vectors into the same dimensionality as the language model's token embeddings. After this projection, each visual feature vector becomes a "visual token" that the language model processes identically to text tokens.

The language model decoder receives the concatenated sequence of visual tokens and text tokens and generates output autoregressively. Because the language model's attention mechanism treats visual tokens like text tokens, it can naturally learn to attend to relevant image regions when answering questions or generating descriptions.

The important insight is that almost all of the capability comes from the pre-trained components. The vision encoder already understands images (from CLIP training). The language model already understands instructions (from LLM pre-training and instruction tuning). The projection layer is small - LLaVA's original linear projection was 4 million parameters compared to billions in the base models. The training required to connect them is relatively lightweight because you are teaching two already-capable systems to communicate, not training from scratch.

Q2: What is dynamic resolution in VLMs and why does it matter for document understanding?

Dynamic resolution refers to VLM architectures that can process images at varying sizes and aspect ratios rather than being limited to a fixed input size.

Early VLMs like LLaVA-1.5 resized all inputs to 336x336 pixels. This works for natural scene understanding where content is distributed across the image, but fails catastrophically for document reading. A standard letter-size document has text at 12pt font, which when compressed to 336px becomes roughly 4-6 pixels tall - below the resolution required for legible character recognition.

LLaVA-NeXT addressed this with a tiling approach: the input image is divided into multiple 336px tiles (up to a 2x2 or 3x2 grid), each encoded separately, with a downsampled thumbnail also included for global context. This produces up to 2880 visual tokens but enables the model to read fine-print text.

Qwen2-VL takes a more principled approach: the model processes images at their native resolution, encoding variable numbers of tokens based on actual image size, and uses 2D rotary position encoding to provide actual spatial position information to each visual token. A 1920x1080 image produces roughly 2700 tokens with true spatial positions preserved, rather than being artificially tiled into a grid.

For document understanding specifically, dynamic resolution is the difference between a model that can read "Total: 1,247.50"fromascannedinvoiceandonethatreturns"Total:1,247.50" from a scanned invoice and one that returns "Total: [unclear]". InternVL2 and Qwen2-VL both use forms of dynamic resolution and are the recommended choices for OCR-intensive production applications.

Q3: How would you set up an evaluation framework for a VLM deployed for invoice processing?

The evaluation framework needs to cover three dimensions: field-level extraction accuracy, failure mode analysis, and operational metrics.

For field-level accuracy: create a test set of at least 200 invoices with ground truth labels for each field you extract (vendor name, invoice number, date, line items, totals). Ground truth should be verified by a human reviewer, not generated programmatically. For each field, define an appropriate correctness criterion - exact match for structured IDs, numeric match within tolerance for amounts, fuzzy match for vendor names (to handle minor OCR variations). Measure per-field accuracy separately, not just overall.

For failure mode analysis: categorize failures by cause. Common failure modes are hallucination (model invents a plausible value for a missing field), OCR errors (model misreads text), schema errors (model returns the right information but in the wrong JSON structure), and refusals (model says it cannot read the document). Each failure mode requires a different mitigation strategy.

For operational metrics: measure what percentage of documents require human review (confidence below threshold), average processing time, and failure rate under production load. A model that is 95% accurate on test data but requires human review on 40% of production documents may not meet business requirements even if benchmark accuracy looks good.

The test set composition matters as much as its size. Include hard cases deliberately: rotated scans, low-quality photocopies, unusual form layouts, handwritten annotations alongside printed text, multi-page documents, and documents from vendors you have not seen before. If your test set is only clean, well-formatted invoices, your accuracy metrics will significantly overestimate production performance.

Q4: What are the tradeoffs between the LLaVA-style projection approach and LLaMA 3.2 Vision's cross-attention approach?

The projection approach (LLaVA) and the cross-attention approach (LLaMA 3.2 Vision) solve the same problem differently with meaningful tradeoffs.

In the projection approach, visual tokens are prepended to the text token sequence. The language model attends to visual tokens in exactly the same way it attends to text tokens - they are part of the primary sequence. This is architecturally simple: take any existing LLM, add a projection layer, fine-tune. The downside is that visual tokens consume context window space and add to the sequence length that attention must process (attention cost is quadratic in sequence length). Adding more tiles for better resolution has a direct cost in both context budget and latency.

In the cross-attention approach (LLaMA 3.2 Vision), the visual encoder output is stored in a separate memory buffer. Cross-attention layers interleaved throughout the LLM attend to this buffer when processing text tokens. The visual information is available to all layers throughout generation without occupying the primary sequence. The context window is available entirely for text. Processing 10 tiles adds no tokens to the text sequence.

The cost of cross-attention: you cannot build it by adapting an existing text-only LLM. Cross-attention layers must be part of the architecture from pre-training. This means LLaMA 3.2 Vision cannot be used as a drop-in replacement for LLaMA 3.1 in tooling that expects a specific architecture - it is a distinct model family.

For production: the projection approach is generally more flexible (easier to swap LLM backbone, easier to quantize, more compatible with serving frameworks). The cross-attention approach provides better context efficiency at higher resolutions. The practical choice depends on whether you are doing high-resolution work (cross-attention advantage) or building on existing LLaMA 3.1 tooling (projection advantage).

Q5: How does CLIP contrastive training create a shared vision-language representation space, and why is this the foundation of VLMs?

CLIP is trained on 400 million image-text pairs with a contrastive objective. For each batch of N pairs, the model computes a similarity score between every possible image-text combination in the batch, producing an N x N similarity matrix. The objective is to make the diagonal (matched pairs) high-similarity and everything off-diagonal (mismatched pairs) low-similarity.

This forces the vision encoder and text encoder to produce embeddings in the same semantic space. The vision encoder must produce a representation of "a cat sitting on a sofa" that is close to the text encoder's representation of that phrase. To satisfy this for 400 million diverse examples, the vision encoder must learn to extract the semantically important aspects of images in terms that the language model can interpret.

This shared representation space is foundational for VLMs because it solves the bridging problem. Before CLIP, vision encoders were trained on classification tasks - they produced features optimized for predicting "cat", "dog", "car" from a fixed label set. These features had no natural alignment with language. Connecting them to an LLM required extensive task-specific training.

After CLIP, vision encoder features already "speak the language of text" in some approximate sense. The projection layer in LLaVA is small (4M parameters) because it only needs to do a mild geometric transformation between two already-aligned spaces, not a full semantic translation between unrelated spaces.

The practical consequence: LLaVA can be trained cheaply with relatively few instruction-following examples because the heavy lifting of visual understanding is already done by the frozen CLIP encoder. The training teaches the LLM to use CLIP features for instruction following - a much simpler task than teaching it visual understanding from scratch.

Q6: What does it mean for a VLM to "hallucinate" visually, and how is this different from LLM text hallucination?

Text hallucination in LLMs is generating false factual claims - saying "Einstein was born in 1879 in Paris" (wrong city) with high confidence. Visual hallucination in VLMs is generating false claims about image content - describing objects, text, or attributes that are not present in the image.

Visual hallucination has several distinct subtypes. Object hallucination means claiming an object is in the image when it is not - saying "there is a dog in the background" when there is no dog. Attribute hallucination means correctly identifying an object but assigning false attributes - "the red car" when the car is blue. OCR hallucination means generating plausible-looking text that is not what the image actually contains - generating a realistic but incorrect invoice number.

The mechanisms differ from text hallucination. Visual hallucination often occurs when the model relies on statistical associations learned from training data rather than actual image content. If a model has seen thousands of invoices, it knows what invoices typically look like and can generate plausible invoice content even from a blurry or partially visible input. The model is essentially "completing" the image from its prior, similar to how LLMs complete sentences based on training distribution.

Mitigation strategies include: low temperature decoding (reduces hallucinated creative completions), explicit null-return instructions ("say null if you cannot read this clearly"), multi-sample consistency checking (if different samples produce different values, flag for human review), and retrieval augmentation (anchor the model to verified reference data where possible).

For production document processing, hallucination is the primary quality risk. The defense in depth strategy is: multiple confidence signals, human review routing for low-confidence outputs, and periodic evaluation audits on held-out documents with verified ground truth.

© 2026 EngineersOfAI. All rights reserved.