Skip to main content

198 docs tagged with "multimodal"

View all tags

A Dataset is Worth 1 MB

A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate o...

Audio-Language Models

How modern AI systems process speech and audio - from Whisper's spectrogram-based ASR to end-to-end audio understanding in GPT-4o and Gemini.

CLIP and Contrastive Learning

How CLIP uses contrastive learning on 400M image-text pairs to build a shared semantic embedding space - enabling zero-shot classification without labeled data.

Diffusion Models

How denoising diffusion models learn to reverse a Gaussian noise process to generate high-quality images from text prompts.

Module 08: Multimodal Models

Understanding how modern AI systems process images, audio, and text together - from CLIP to diffusion to production pipelines.

Multimodal RAG

How to build retrieval-augmented generation systems that can retrieve and reason over images, PDFs with figures, slides, and mixed-media documents.

Posterior Augmented Flow Matching

Flow matching (FM) trains a time-dependent vector field that transports samples from a simple prior to a complex data distribution. However, for high-di...

Production Multimodal Systems

Build and operate multimodal AI pipelines at production scale - image preprocessing, cost control, VLM hallucination mitigation, caching, security, and observability for vision-language workloads.

Spatial Calibration of Diffuse LiDARs

Diffuse direct time-of-flight LiDARs report per-pixel depth histograms formed by aggregating photon returns over a wide instantaneous field of view, vio...

Vision-Language Models

How modern AI systems combine vision encoders with language models to understand and reason about images.