216 docs tagged with "multimodal"

3D-ReGen: A Unified 3D Geometry Regeneration Framework

We consider the problem of regenerating 3D objects from 2D images and initial 3D shapes. Most 3D generators operate in a one-shot fashion, converting te...

A Dataset is Worth 1 MB

A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate o...

A Mixed Diet Makes DINO An Omnivorous Vision Encoder

Pre-trained vision encoders like DINOv2 have demonstrated exceptional performance on unimodal tasks. However, we observe that their feature representati...

A multimodal slice discovery framework for systematic failure detection and explanation in medical image classification

Despite advances in machine learning-based medical image classifiers, the safety and reliability of these systems remain major concerns in practical set...

A Two-Stage, Object-Centric Deep Learning Framework for Robust Exam Cheating Detection

Academic integrity continues to face the persistent challenge of examination cheating. Traditional invigilation relies on human observation, which is in...

Action Motifs: Self-Supervised Hierarchical Representation of Human Body Movements

Effective human behavior modeling requires a representation of the human body movement that capitalizes on its compositionality. We propose a hierarchic...

Adaptive Greedy Frame Selection for Long Video Understanding

Large vision--language models (VLMs) are increasingly applied to long-video question answering, yet inference is often bottlenecked by the number of inp...

AEGIS: A Holistic Benchmark for Evaluating Forensic Analysis of AI-Generated Academic Images

We introduce AEGIS, A holistic benchmark for Evaluating forensic analysis of AI-Generated academic ImageS. Compared to existing benchmarks, AEGIS featur...

AgenticOCR: Parsing Only What You Need for Efficient Retrieval-Augmented Generation

The expansion of retrieval-augmented generation (RAG) into multimodal domains has intensified the challenge for processing complex visual documents, suc...

AIFIND: Artifact-Aware Interpreting Fine-Grained Alignment for Incremental Face Forgery Detection

As forgery types continue to emerge consistently, Incremental Face Forgery Detection (IFFD) has become a crucial paradigm. However, existing methods typ...

An automatic counting algorithm for the quantification and uncertainty analysis of the number of microglial cells trainable in small and heterogeneous datasets

Counting immunopositive cells on biological tissues generally requires either manual annotation or (when available) automatic rough systems, for scannin...

ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors

Synthesizing physically plausible articulated human-object interactions (HOI) without 3D/4D supervision remains a fundamental challenge. While recent ze...

Artificial Intelligence for Detecting Fetal Orofacial Clefts and Advancing Medical Education

Orofacial clefts are among the most common congenital craniofacial abnormalities, yet accurate prenatal detection remains challenging due to the scarcit...

AsymLoc: Towards Asymmetric Feature Matching for Efficient Visual Localization

Precise and real-time visual localization is critical for applications like AR/VR and robotics, especially on resource-constrained edge devices such as...

Audio-Language Models

How modern AI systems process speech and audio - from Whisper's spectrogram-based ASR to end-to-end audio understanding in GPT-4o and Gemini.

Automated Prediction of Postoperative Pancreatic Fistula Using Preoperative Computed Tomography

Postoperative pancreatic fistula (POPF) is a serious complication after pancreatic resection, increasing morbidity, hospital stay, and healthcare costs....

AV-Unified: A Unified Framework for Audio-visual Scene Understanding

When humans perceive the world, they naturally integrate multiple audio-visual tasks within dynamic, real-world scenes. However, current works such as e...

Balancing Fidelity, Utility, and Privacy in Synthetic Cardiac MRI Generation: A Comparative Study

Deep learning in cardiac MRI (CMR) is fundamentally constrained by both data scarcity and privacy regulations. This study systematically benchmarks thre...

BenDFM: A taxonomy and synthetic CAD dataset for manufacturability assessment in sheet metal bending

Predicting the manufacturability of CAD designs early, in terms of both feasibility and required effort, is a key goal of Design for Manufacturing (DFM)...

BEVLM: Distilling Semantic Knowledge from LLMs into Bird's-Eye View Representations

The integration of Large Language Models (LLMs) into autonomous driving has attracted growing interest for their strong reasoning and semantic understan...

Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation

We introduce **CRYSTAL** (*__C__lear __R__easoning via __Y__ielded __S__teps, __T__raceability and __L__ogic*), a diagnostic benchmark with 6,372 instan...

Beyond Gaussian Bottlenecks: Topologically Aligned Encoding of Vision-Transformer Feature Spaces

Modern visual world modeling systems increasingly rely on high-capacity architectures and large-scale data to produce plausible motion, yet they often f...

Beyond Mixtures and Products for Ensemble Aggregation: A Likelihood Perspective on Generalized Means

Density aggregation is a central problem in machine learning, for instance when combining predictions from a Deep Ensemble. The choice of aggregation re...

Beyond Pixel Fidelity: Minimizing Perceptual Distortion and Color Bias in Night Photography Rendering

Night Photography Rendering (NPR) poses a significant challenge due to the extreme contrast between dark and illuminated areas in scenes, stemming from...

Beyond Single Tokens: Distilling Discrete Diffusion Models via Discrete MMD

It is currently difficult to distill discrete diffusion models. In contrast, continuous diffusion literature has many distillation approaches methods th...

CAMEO: A Conditional and Quality-Aware Multi-Agent Image Editing Orchestrator

Conditional image editing aims to modify a source image according to textual prompts and optional reference guidance. Such editing is crucial in scenari...

Can Large Multimodal Models Inspect Buildings? A Hierarchical Benchmark for Structural Pathology Reasoning

Automated building facade inspection is a critical component of urban resilience and smart city maintenance. Traditionally, this field has relied on spe...

Chain-of-Adaptation: Surgical Vision-Language Adaptation with Reinforcement Learning

Conventional fine-tuning on domain-specific datasets can inadvertently alter a model's pretrained multimodal priors, leading to reduced generalization....

CLIP and Contrastive Learning

How CLIP uses contrastive learning on 400M image-text pairs to build a shared semantic embedding space - enabling zero-shot classification without labeled data.

CLoPA: Continual Low Parameter Adaptation of Interactive Segmentation for Medical Image Annotation

Interactive segmentation enables clinicians to guide annotation, but existing zero-shot models like nnInteractive fail to consistently reach expert-leve...

CoFiDA-M: Concept-Aware Feature Modulation for Cross-Domain Adaptation with Image-Only Inference

Models for AI-based skin cancer screening suffer a severe performance drop when shifting from expert dermoscopic (source) images to consumer-grade clini...

CollideNet: Hierarchical Multi-scale Video Representation Learning with Disentanglement for Time-To-Collision Forecasting

Time-to-Collision (TTC) forecasting is a critical task in collision prevention, requiring precise temporal prediction and comprehending both local and g...

CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

Recent vision-language models (VLMs) typically rely on a single vision encoder trained with contrastive image-text objectives, such as CLIP-style pretra...

Compositional Generalization Requires Linear, Orthogonal Representations in Vision Embedding Models

Compositional generalization, the ability to recognize familiar parts in novel contexts, is a defining property of intelligent systems. Although modern...

Continuous-tone Simple Points: An $\ell_0$-Norm of Cyclic Gradient for Topology-Preserving Data-Driven Image Segmentation

Topological features play an essential role in ensuring geometric plausibility and structural consistency in image analysis tasks such as segmentation a...

CoVR-R:Reason-Aware Composed Video Retrieval

Composed Video Retrieval (CoVR) aims to find a target video given a reference video and a textual modification. Prior work assumes the modification text...

Decomposing Private Image Generation via Coarse-to-Fine Wavelet Modeling

Generative models trained on sensitive image datasets risk memorizing and reproducing individual training examples, making strong privacy guarantees ess...

DENALI: A Dataset Enabling Non-Line-of-Sight Spatial Reasoning with Low-Cost LiDARs

Consumer LiDARs in mobile devices and robots typically output a single depth value per pixel. Yet internally, they record full time-resolved histograms...

Dental Panoramic Radiograph Analysis Using YOLO26 From Tooth Detection to Disease Diagnosis

Panoramic radiography is a fundamental diagnostic tool in dentistry, offering a comprehensive view of the entire dentition with minimal radiation exposu...

Deterministic Mode Proposals: An Efficient Alternative to Generative Sampling for Ambiguous Segmentation

Many segmentation tasks, such as medical image segmentation or future state prediction, are inherently ambiguous, meaning that multiple predictions are...

Diffusion Models

How denoising diffusion models learn to reverse a Gaussian noise process to generate high-quality images from text prompts.

Diffusion-Based Feature Denoising and Using NNMF for Robust Brain Tumor Classification

Brain tumor classification from magnetic resonance imaging, which is also known as MRI, plays a sensitive role in computer-assisted diagnosis systems. I...

DiT-IC: Aligned Diffusion Transformer for Efficient Image Compression

Diffusion-based image compression has recently shown outstanding perceptual fidelity, yet its practicality is hindered by prohibitive sampling overhead...

Do Foundation Models Know Geometry? Probing Frozen Features for Continuous Physical Measurement

Vision-language models encode continuous geometry that their text pathway fails to express: a 6,000-parameter linear probe extracts hand joint angles at...

Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

Reasoning in vision-language models (VLMs) has recently attracted significant attention due to its broad applicability across diverse downstream tasks....

DSVTLA: Deep Swin Vision Transformer-Based Transfer Learning Architecture for Multi-Type Cancer Histopathological Cancer Image Classification

In this study, we proposed a deep Swin-Vision Transformer-based transfer learning architecture for robust multi-cancer histopathological image classific...

EffiMiniVLM: A Compact Dual-Encoder Regression Framework

Predicting product quality from multimodal item information is critical in cold-start scenarios, where user interaction history is unavailable and predi...

EgoForge: Goal-Directed Egocentric World Simulator

Generative world models have shown promise for simulating dynamic environments, yet egocentric video remains challenging due to rapid viewpoint changes,...

EgoReasoner: Learning Egocentric 4D Reasoning via Task-Adaptive Structured Thinking

Egocentric video understanding is inherently complex due to the dynamic 4D nature of the environment, where camera motion and object displacements neces...

EGOSTREAM: A Diagnostic Benchmark for Streaming Episodic Memory in Egocentric Vision

Continuous episodic memory is a core capability for autonomous agents operating in dynamic, real-world environments, yet current streaming video benchma...

EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks

Large foundation models have made significant advances in embodied intelligence, enabling synthesis and reasoning over egocentric input for household ta...

Enhancing Authorship Attribution with Synthetic Paintings

Attributing authorship to paintings is a historically complex task, and one of its main challenges is the limited availability of real artworks for trai...

Enhancing Hazy Wildlife Imagery: AnimalHaze3k and IncepDehazeGan

Atmospheric haze significantly degrades wildlife imagery, impeding computer vision applications critical for conservation, such as animal detection, tra...

Enhancing Spatial Understanding in Image Generation via Reward Modeling

Recent progress in text-to-image generation has greatly advanced visual fidelity and creativity, but it has also imposed higher demands on prompt comple...

Envisioning the Future, One Step at a Time

Accurately anticipating how complex, diverse scenes will evolve requires models that represent uncertainty, simulate along extended interaction chains,...

Exploring the Limits of End-to-End Feature-Affinity Propagation for Single-Point Supervised Infrared Small Target Detection

Single-point supervised infrared small target detection (IRSTD) drastically reduces dense annotation costs. Current state-of-the-art (SOTA) methods achi...

FDeID-Toolbox: Face De-Identification Toolbox

Face de-identification (FDeID) aims to remove personally identifiable information from facial images while preserving task-relevant utility attributes s...

Feature-Optimized Vision for Adaptive 3D Scene Reconstruction

Three-dimensional scene reconstruction depends on local image evidence that is both visually discriminative and geometrically useful. Fixed feature thre...

Find, Fix, Reason: Context Repair for Video Reasoning

Reinforcement learning has advanced video reasoning in large multi-modal models, yet dominant pipelines either rely on on-policy self-exploration, which...

FineCog-Nav: Integrating Fine-grained Cognitive Modules for Zero-shot Multimodal UAV Navigation

UAV vision-language navigation (VLN) requires an agent to navigate complex 3D environments from an egocentric perspective while following ambiguous mult...

Finite Difference Flow Optimization for RL Post-Training of Text-to-Image Models

Reinforcement learning (RL) has become a standard technique for post-training diffusion-based image synthesis models, as it enables learning from reward...

Fixed Anchors Are Not Enough: Dynamic Retrieval and Persistent Homology for Dataset Distillation

Decoupled dataset distillation (DD) compresses large corpora into a few synthetic images by matching a frozen teacher's statistics. However, current res...

FocusGraph: Graph-Structured Frame Selection for Embodied Long Video Question Answering

The ability to understand long videos is vital for embodied intelligent agents, because their effectiveness depends on how well they can accumulate, org...

Foundation AI Models for Aerosol Optical Depth Estimation from PACE Satellite Data

Aerosol Optical Depth (AOD) retrieval is essential for Earth observation, supporting applications from air quality monitoring to climate studies. Conven...

Fractals made Practical: Denoising Diffusion as Partitioned Iterated Function Systems

What is a diffusion model actually doing when it turns noise into a photograph? We show that the deterministic DDIM reverse chain operates as a Partitio...

From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering

Existing tampering detection benchmarks largely rely on object masks, which severely misalign with the true edit signal: many pixels inside a mask are u...

GAViD: A Large-Scale Multimodal Dataset for Context-Aware Group Affect Recognition from Videos

Understanding affective dynamics in real-world social systems is fundamental to modeling and analyzing human-human interactions in complex environments....

Generalizable Sparse-View 3D Reconstruction from Unconstrained Images

Reconstructing 3D scenes from sparse, unposed images remains challenging under real-world conditions with varying illumination and transient occlusions....

GeoDiff4D: Geometry-Aware Diffusion for 4D Head Avatar Reconstruction

Reconstructing photorealistic and animatable 4D head avatars from a single portrait image remains a fundamental challenge in computer vision. While diff...

Geometry-Guided Camera Motion Understanding in VideoLLMs

Camera motion is a fundamental geometric signal that shapes visual perception and cinematic style, yet current video-capable vision-language models (Vid...

GMGaze: MoE-Based Context-Aware Gaze Estimation with CLIP and Multiscale Transformer

Gaze estimation methods commonly use facial appearances to predict the direction of a person gaze. However, previous studies show three major challenges...

Helios: Real Real-Time Long Video Generation Model

We introduce Helios, the first 14B video generation model that runs at 19.5 FPS on a single NVIDIA H100 GPU and supports minute-scale generation while m...

HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation

Driving world models serve as a pivotal technology for autonomous driving by simulating environmental dynamics. However, existing approaches predominant...

Hero-Mamba: Mamba-based Dual Domain Learning for Underwater Image Enhancement

Underwater images often suffer from severe degradation, such as color distortion, low contrast, and blurred details, due to light absorption and scatter...

Hierarchical Action Learning for Weakly-Supervised Action Segmentation

Humans perceive actions through key transitions that structure actions across multiple abstraction levels, whereas machines, relying on visual features,...

Histopathology Image Normalization via Latent Manifold Compaction

Batch effects arising from technical variations in histopathology staining protocols, scanners, and acquisition pipelines pose a persistent challenge fo...

Hold-One-Shot-Out (HOSO) for Validation-Free Few-Shot CLIP Adapters

In many CLIP adaptation methods, a blending ratio hyperparameter controls the trade-off between general pretrained CLIP knowledge and the limited, datas...

HumanOrbit: 3D Human Reconstruction as 360° Orbit Generation

We present a method for generating a full 360° orbit video around a person from a single input image. Existing methods typically adapt image-based diffu...

HyperCT: Low-Rank Hypernet for Unified Chest CT Analysis

Non-contrast chest CTs offer a rich opportunity for both conventional pulmonary and opportunistic extra-pulmonary screening. While Multi-Task Learning (...

Improving Image-to-Image Translation via a Rectified Flow Reformulation

In this work, we propose Image-to-Image Rectified Flow Reformulation (I2I-RFR), a practical plug-in reformulation that recasts standard I2I regression n...

Incremental Semantics-Aided Meshing from LiDAR-Inertial Odometry and RGB Direct Label Transfer

Geometric high-fidelity mesh reconstruction from LiDAR-inertial scans remains challenging in large, complex indoor environments -- such as cultural buil...

Influence Malleability in Linearized Attention: Dual Implications of Non-Convergent NTK Dynamics

Understanding the theoretical foundations of attention mechanisms remains challenging due to their complex, non-linear dynamics. This work reveals a fun...

Information Router for Mitigating Modality Dominance in Vision-Language Models

Vision Language models (VLMs) have demonstrated strong performance across a wide range of benchmarks, yet they often suffer from modality dominance, whe...

Information-geometric adaptive sampling for graph diffusion

Standard diffusion models for graph generation typically rely on uniform time-stepping, an approach that overlooks the non-homogeneous dynamics of distr...

InpaintSLat: Inpainting Structured 3D Latents via Initial Noise Optimization

We present a training-free approach for controllable 3D inpainting based on initial noise optimization. In the structured 3D latent diffusion framework,...

Joint Geometric and Trajectory Consistency Learning for One-Step Real-World Super-Resolution

Diffusion-based Real-World Image Super-Resolution (Real-ISR) achieves impressive perceptual quality but suffers from high computational costs due to ite...

Joint Multi-Camera LiDAR Extrinsic Calibration via Learned Pairwise Initialization and Geometric Refinement

Most learning-based camera-LiDAR calibration methods treat each camera-LiDAR pair independently, ignoring the rigid geometric coupling in multi-camera p...

KLIP: localized distribution shift detection via KL-divergence with diffusion priors in Inverse Problems

Diffusion models have shown promising performance as data-driven priors for computational imaging, as well as some capacity to detect out-of-distributio...

LagerNVS: Latent Geometry for Fully Neural Real-time Novel View Synthesis

Recent work has shown that neural networks can perform 3D tasks such as Novel View Synthesis (NVS) without explicit 3D reconstruction. Even so, we argue...

Large Multimodal Models as General In-Context Classifiers

Which multimodal model should we use for classification? Previous studies suggest that the answer lies in CLIP-like contrastive Vision-Language Models (...

LaST-R1: Reinforcing Action via Adaptive Physical Latent Reasoning for VLA Models

Vision-Language-Action (VLA) models have increasingly incorporated reasoning mechanisms for complex robotic manipulation. However, existing approaches s...

Learning Coarse-to-Fine Osteoarthritis Representations under Noisy Hierarchical Labels

Knee osteoarthritis (OA) assessment involves a natural but often underused label hierarchy: a coarse binary OA decision and a fine-grained Kellgren--Law...

Learning Global Motion with Compact Gaussians for Feed-Forward 4D Reconstruction

Dynamic scene reconstruction from monocular video remains a fundamental challenge in computer vision. Existing feed-forward methods predict 3D Gaussians...

Let ViT Speak: Generative Language-Image Pre-training

In this paper, we present \textbf{Gen}erative \textbf{L}anguage-\textbf{I}mage \textbf{P}re-training (GenLIP), a minimalist generative pretraining frame...

Linear Scaling Video VLMs for Long Video Understanding

Video vision-language models (VLMs) are increasingly used in long-horizon and streaming settings, yet most video encoders still rely on spatiotemporal s...

LineGraph2Road: Structural Graph Reasoning on Line Graphs for Road Network Extraction

The accurate and automatic extraction of roads from satellite imagery is critical for applications in navigation and urban planning, significantly reduc...

Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

Connector-based video unified models have demonstrated strong capability in instruction-grounded video synthesis, but integrating a large high-fidelity...

LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation

Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained contr...

Make Your LVLM KV Cache More Lightweight

Key-Value (KV) cache has become a de facto component of modern Large Vision-Language Models (LVLMs) for inference. While it enhances decoding efficiency...

Manifold-Preserving Superpixel Hierarchies and Embeddings for the Exploration of High-Dimensional Images

High-dimensional images, or images with a high-dimensional attribute vector per pixel, are commonly explored with coordinated views of a low-dimensional...

ManifoldGD: Training-Free Hierarchical Manifold Guidance for Diffusion-Based Dataset Distillation

In recent times, large datasets hinder efficient model training while also containing redundant concepts. Dataset distillation aims to synthesize compac...

Map2World: Segment Map Conditioned Text to 3D World Generation

3D world generation is essential for applications such as immersive content creation or autonomous driving simulation. Recent advances in 3D world gener...

MediX-R1: Open Ended Medical Reinforcement Learning

We introduce MediX-R1, an open-ended Reinforcement Learning (RL) framework for medical multimodal large language models (MLLMs) that enables clinically...

MME-CoF-Pro: Evaluating Reasoning Coherence in Video Generative Models with Text and Visual Hints

Video generative models show emerging reasoning behaviors. It is essential to ensure that generated events remain causally consistent across frames for...

MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons

Recent methods for arbitrary-skeleton motion capture from monocular video follow a factorized pipeline, where a Video-to-Pose network predicts joint pos...

Mode Seeking meets Mean Seeking for Fast Long Video Generation

Scaling video generation from seconds to minutes faces a critical bottleneck: while short-video data is abundant and high-fidelity, coherent long-form d...

Modeling and Measuring Redundancy in Multisource Multimodal Data for Autonomous Driving

Next-generation autonomous vehicles (AVs) rely on large volumes of multisource and multimodal ($M^2$) data to support real-time decision-making. In prac...

Modeling Subjective Urban Perception with Human Gaze

Urban perception describes how people subjectively evaluate urban environments, shaping how cities are experienced and understood. Existing computationa...

Module 08: Multimodal Models

Understanding how modern AI systems process images, audio, and text together - from CLIP to diffusion to production pipelines.

MOO: A Multi-view Oriented Observations Dataset for Viewpoint Analysis in Cattle Re-Identification

Animal re-identification (ReID) faces critical challenges due to viewpoint variations, particularly in Aerial-Ground (AG-ReID) settings where models mus...

MovieTeller: Tool-augmented Movie Synopsis with ID Consistent Progressive Abstraction

With the explosive growth of digital entertainment, automated video summarization has become indispensable for applications such as content indexing, pe...

Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model

Robotic manipulation requires understanding both the 3D spatial structure of the environment and its temporal evolution, yet most existing policies over...

Multimodal Large Language Models as Image Classifiers

Multimodal Large Language Models (MLLM) classification performance depends critically on evaluation protocol and ground truth quality. Studies comparing...

Multimodal Optimal Transport for Unsupervised Temporal Segmentation in Surgical Robotics

Recognizing surgical phases and steps from video is a fundamental problem in computer-assisted interventions. Recent approaches increasingly rely on lar...

Multimodal RAG

How to build retrieval-augmented generation systems that can retrieve and reason over images, PDFs with figures, slides, and mixed-media documents.

MuSteerNet: Human Reaction Generation from Videos via Observation-Reaction Mutual Steering

Video-driven human reaction generation aims to synthesize 3D human motions that directly react to observed video sequences, which is crucial for buildin...

MuViT: Multi-Resolution Vision Transformers for Learning Across Scales in Microscopy

Modern microscopy routinely produces gigapixel images that contain structures across multiple spatial scales, from fine cellular morphology to broader t...

NEGATE: Constrained Semantic Guidance for Linguistic Negation in Text-to-Video Diffusion

Negation is a fundamental linguistic operator, yet it remains inadequately modeled in diffusion-based generative systems. In this work, we present a for...

NOIR: Neural Operator mapping for Implicit Representations

This paper presents NOIR, a framework that reframes core medical imaging tasks as operator learning between continuous function spaces, challenging the...

nuReasoning: A Reasoning-Centric Dataset and Benchmark for Long-Tail Autonomous Driving

Reasoning is essential for autonomous driving (AD) in long-tail scenarios, where vehicles must apply commonsense knowledge, understand spatial relations...

Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

While recent multimodal large language models (MLLMs) have made impressive strides, they predominantly employ a conventional autoregressive architecture...

OmniRobotHome: A Multi-Camera Platform for Real-Time Multiadic Human-Robot Interaction

Human-robot collaboration has been studied primarily in dyadic or sequential settings. However, real homes require multiadic collaboration, where multip...

Online3R: Online Learning for Consistent Sequential Reconstruction Based on Geometry Foundation Model

We present Online3R, a new sequential reconstruction framework that is capable of adapting to new scenes through online learning, effectively resolving...

Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models

Evolutions in the world, such as water pouring or ice melting, happen regardless of being observed. Video world models generate worlds via 2D frame ob...

Panoramic Multimodal Semantic Occupancy Prediction for Quadruped Robots

Panoramic imagery provides holistic 360° visual coverage for perception in quadruped robots. However, existing occupancy prediction methods are mainly d...

Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

Vision Language Model (VLM) development has largely relied on scaling model size, which hinders deployment on compute-constrained mobile and edge device...

Perceive What Matters: Relevance-Driven Scheduling for Multimodal Streaming Perception

In modern human-robot collaboration (HRC) applications, multiple perception modules jointly extract visual, auditory, and contextual cues to achieve com...

Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

While autoregressive Large Vision-Language Models (LVLMs) demonstrate remarkable proficiency in multimodal tasks, they face a 'Visual Signal Dilution' p...

PGVMS: A Prompt-Guided Unified Framework for Virtual Multiplex IHC Staining with Pathological Semantic Learning

Immunohistochemical (IHC) staining enables precise molecular profiling of protein expression, with over 200 clinically available antibody-based tests in...

PhyCo: Learning Controllable Physical Priors for Generative Motion

Modern video diffusion models excel at appearance synthesis but still struggle with physical consistency: objects drift, collisions lack realistic rebou...

PhysEdit: Physically-Consistent Region-Aware Image Editing via Adaptive Spatio-Temporal Reasoning

Image editing instructions are heterogeneous: a color swap, an object insertion, and a physical-action edit all demand different spatial coverage and di...

PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization

Recent progress in text-conditioned human motion generation has been largely driven by diffusion models trained on large-scale human motion data. Buildi...

Plug-and-Play Diffusion Meets ADMM: Dual-Variable Coupling for Robust Medical Image Reconstruction

Plug-and-Play diffusion prior (PnPDP) frameworks have emerged as a powerful paradigm for solving imaging inverse problems by treating pretrained generat...

Pointer-CAD: Unifying B-Rep and Command Sequences via Pointer-based Edges & Faces Selection

Constructing computer-aided design (CAD) models is labor-intensive but essential for engineering and manufacturing. Recent advances in Large Language Mo...

Posterior Augmented Flow Matching

Flow matching (FM) trains a time-dependent vector field that transports samples from a simple prior to a complex data distribution. However, for high-di...

PR3DICTR: A modular AI framework for medical 3D image-based detection and outcome prediction

Three-dimensional medical image data and computer-aided decision making, particularly using deep learning, are becoming increasingly important in the me...

PRIMA: Pre-training with Risk-integrated Image-Metadata Alignment for Medical Diagnosis via LLM

Medical diagnosis requires the effective synthesis of visual manifestations and clinical metadata. However, existing methods often treat metadata as iso...

PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning

The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforc...

Production Multimodal Systems

Build and operate multimodal AI pipelines at production scale - image preprocessing, cost control, VLM hallucination mitigation, caching, security, and observability for vision-language workloads.

ProtoFlow: Mitigating Forgetting in Class-Incremental Remote Sensing Segmentation via Low-Curvature Prototype Flow

Remote sensing segmentation in real deployment is inherently continual: new semantic categories emerge, and acquisition conditions shift across seasons,...

Prune Wisely, Reconstruct Sharply: Compact 3D Gaussian Splatting via Adaptive Pruning and Difference-of-Gaussian Primitives

Recent significant advances in 3D scene representation have been driven by 3D Gaussian Splatting (3DGS), which has enabled real-time rendering with phot...