Module 08: Multimodal Models
Language models started with text. But human communication is inherently multimodal - we use images to explain what words can't capture, audio to convey emotion and nuance, and diagrams to compress complex ideas into a single glance. This module covers how modern AI systems bridge those modalities.
By the end of this module you will understand how vision encoders turn pixels into token sequences that transformers can reason over, why contrastive learning on internet-scale image-text pairs unlocks zero-shot classification, how diffusion models learn to reverse noise into coherent images, and how to build production pipelines that handle the cost, latency, and failure modes unique to multimodal AI.
The Multimodal Landscape
Lessons in This Module
| # | Lesson | Core Concept | Key Paper |
|---|---|---|---|
| 01 | Vision-Language Models | ViT + projection layer + LLM | LLaVA (Liu et al., 2023) |
| 02 | CLIP and Contrastive Learning | Dual encoder + InfoNCE loss | CLIP (Radford et al., 2021) |
| 03 | Diffusion Models | Reverse noise process | DDPM, Stable Diffusion |
| 04 | Audio-Language Models | Spectrogram + encoder-decoder | Whisper (Radford et al., 2022) |
| 05 | Multimodal RAG | Image retrieval + VLM captioning | ColPali (Faysse et al., 2024) |
| 06 | Production Multimodal Systems | Cost, latency, failure modes | Engineering practice |
Prerequisites
Before starting this module you should be comfortable with:
- Transformer architecture - self-attention, positional encodings, encoder-decoder structure (Module 01)
- Pretraining and fine-tuning - how models are trained on large corpora, then adapted (Module 02)
- Embeddings - what a dense vector representation means, cosine similarity (Module 17)
- Basic RAG - retrieval-augmented generation pipeline (covered in the RAG track)
Key Concepts Glossary
| Term | Definition |
|---|---|
| Vision Transformer (ViT) | Transformer applied to image patches instead of tokens |
| Image patch | Fixed-size tile of pixels (typically 14×14 or 16×16) that becomes one "token" |
| Contrastive learning | Training by pulling matching pairs together and pushing non-matching pairs apart in embedding space |
| InfoNCE loss | The contrastive loss function used by CLIP; short for Noise Contrastive Estimation |
| Latent diffusion | Running diffusion in compressed latent space rather than pixel space (Stable Diffusion) |
| Classifier-free guidance (CFG) | Technique to trade diversity for prompt adherence in diffusion models |
| Cross-attention fusion | Fusing modalities by letting one attend over another (used in Flamingo) |
| Q-Former | Bottleneck query transformer that distills image features for LLM input (used in BLIP-2) |
| VLM | Vision-Language Model - any model that processes both images and text |
| Log-Mel spectrogram | Standard audio representation: frequency vs time with log-scale Mel filterbanks |
| ColPali | Document retrieval method treating pages as images with patch-level multi-vector embeddings |
| Image tokens | How images are represented in VLM context windows - typically 256–1024 tokens per image |
