Module 08: Multimodal Models

Language models started with text. But human communication is inherently multimodal - we use images to explain what words can't capture, audio to convey emotion and nuance, and diagrams to compress complex ideas into a single glance. This module covers how modern AI systems bridge those modalities.

By the end of this module you will understand how vision encoders turn pixels into token sequences that transformers can reason over, why contrastive learning on internet-scale image-text pairs unlocks zero-shot classification, how diffusion models learn to reverse noise into coherent images, and how to build production pipelines that handle the cost, latency, and failure modes unique to multimodal AI.

The Multimodal Landscape

Lessons in This Module

#	Lesson	Core Concept	Key Paper
01	Vision-Language Models	ViT + projection layer + LLM	LLaVA (Liu et al., 2023)
02	CLIP and Contrastive Learning	Dual encoder + InfoNCE loss	CLIP (Radford et al., 2021)
03	Diffusion Models	Reverse noise process	DDPM, Stable Diffusion
04	Audio-Language Models	Spectrogram + encoder-decoder	Whisper (Radford et al., 2022)
05	Multimodal RAG	Image retrieval + VLM captioning	ColPali (Faysse et al., 2024)
06	Production Multimodal Systems	Cost, latency, failure modes	Engineering practice

Prerequisites

Before starting this module you should be comfortable with:

Transformer architecture - self-attention, positional encodings, encoder-decoder structure (Module 01)
Pretraining and fine-tuning - how models are trained on large corpora, then adapted (Module 02)
Embeddings - what a dense vector representation means, cosine similarity (Module 17)
Basic RAG - retrieval-augmented generation pipeline (covered in the RAG track)

Key Concepts Glossary

Term	Definition
Vision Transformer (ViT)	Transformer applied to image patches instead of tokens
Image patch	Fixed-size tile of pixels (typically 14×14 or 16×16) that becomes one "token"
Contrastive learning	Training by pulling matching pairs together and pushing non-matching pairs apart in embedding space
InfoNCE loss	The contrastive loss function used by CLIP; short for Noise Contrastive Estimation
Latent diffusion	Running diffusion in compressed latent space rather than pixel space (Stable Diffusion)
Classifier-free guidance (CFG)	Technique to trade diversity for prompt adherence in diffusion models
Cross-attention fusion	Fusing modalities by letting one attend over another (used in Flamingo)
Q-Former	Bottleneck query transformer that distills image features for LLM input (used in BLIP-2)
VLM	Vision-Language Model - any model that processes both images and text
Log-Mel spectrogram	Standard audio representation: frequency vs time with log-scale Mel filterbanks
ColPali	Document retrieval method treating pages as images with patch-level multi-vector embeddings
Image tokens	How images are represented in VLM context windows - typically 256–1024 tokens per image

The Multimodal Landscape​

Lessons in This Module​

Prerequisites​

Key Concepts Glossary​

The Multimodal Landscape

Lessons in This Module

Prerequisites

Key Concepts Glossary