01Module 08: Multimodal ModelsUnderstanding how modern AI systems process images, audio, and text together - from CLIP to diffusion to production pipelines.02Vision-Language ModelsHow modern AI systems combine vision encoders with language models to understand and reason about images.03CLIP and Contrastive LearningHow CLIP uses contrastive learning on 400M image-text pairs to build a shared semantic embedding space - enabling zero-shot classification without labeled data.04Diffusion ModelsHow denoising diffusion models learn to reverse a Gaussian noise process to generate high-quality images from text prompts.05Audio-Language ModelsHow modern AI systems process speech and audio - from Whisper's spectrogram-based ASR to end-to-end audio understanding in GPT-4o and Gemini.06Multimodal RAGHow to build retrieval-augmented generation systems that can retrieve and reason over images, PDFs with figures, slides, and mixed-media documents.07Production Multimodal SystemsBuild and operate multimodal AI pipelines at production scale - image preprocessing, cost control, VLM hallucination mitigation, caching, security, and observability for vision-language workloads.