Skip to main content

Module 08: Multimodal Models

Language models started with text. But human communication is inherently multimodal - we use images to explain what words can't capture, audio to convey emotion and nuance, and diagrams to compress complex ideas into a single glance. This module covers how modern AI systems bridge those modalities.

By the end of this module you will understand how vision encoders turn pixels into token sequences that transformers can reason over, why contrastive learning on internet-scale image-text pairs unlocks zero-shot classification, how diffusion models learn to reverse noise into coherent images, and how to build production pipelines that handle the cost, latency, and failure modes unique to multimodal AI.

The Multimodal Landscape

Lessons in This Module

#LessonCore ConceptKey Paper
01Vision-Language ModelsViT + projection layer + LLMLLaVA (Liu et al., 2023)
02CLIP and Contrastive LearningDual encoder + InfoNCE lossCLIP (Radford et al., 2021)
03Diffusion ModelsReverse noise processDDPM, Stable Diffusion
04Audio-Language ModelsSpectrogram + encoder-decoderWhisper (Radford et al., 2022)
05Multimodal RAGImage retrieval + VLM captioningColPali (Faysse et al., 2024)
06Production Multimodal SystemsCost, latency, failure modesEngineering practice

Prerequisites

Before starting this module you should be comfortable with:

  • Transformer architecture - self-attention, positional encodings, encoder-decoder structure (Module 01)
  • Pretraining and fine-tuning - how models are trained on large corpora, then adapted (Module 02)
  • Embeddings - what a dense vector representation means, cosine similarity (Module 17)
  • Basic RAG - retrieval-augmented generation pipeline (covered in the RAG track)

Key Concepts Glossary

TermDefinition
Vision Transformer (ViT)Transformer applied to image patches instead of tokens
Image patchFixed-size tile of pixels (typically 14×14 or 16×16) that becomes one "token"
Contrastive learningTraining by pulling matching pairs together and pushing non-matching pairs apart in embedding space
InfoNCE lossThe contrastive loss function used by CLIP; short for Noise Contrastive Estimation
Latent diffusionRunning diffusion in compressed latent space rather than pixel space (Stable Diffusion)
Classifier-free guidance (CFG)Technique to trade diversity for prompt adherence in diffusion models
Cross-attention fusionFusing modalities by letting one attend over another (used in Flamingo)
Q-FormerBottleneck query transformer that distills image features for LLM input (used in BLIP-2)
VLMVision-Language Model - any model that processes both images and text
Log-Mel spectrogramStandard audio representation: frequency vs time with log-scale Mel filterbanks
ColPaliDocument retrieval method treating pages as images with patch-level multi-vector embeddings
Image tokensHow images are represented in VLM context windows - typically 256–1024 tokens per image
© 2026 EngineersOfAI. All rights reserved.