3D-ReGen: A Unified 3D Geometry Regeneration Framework
We consider the problem of regenerating 3D objects from 2D images and initial 3D shapes. Most 3D generators operate in a one-shot fashion, converting te...
We consider the problem of regenerating 3D objects from 2D images and initial 3D shapes. Most 3D generators operate in a one-shot fashion, converting te...
A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate o...
Pre-trained vision encoders like DINOv2 have demonstrated exceptional performance on unimodal tasks. However, we observe that their feature representati...
Despite advances in machine learning-based medical image classifiers, the safety and reliability of these systems remain major concerns in practical set...
Academic integrity continues to face the persistent challenge of examination cheating. Traditional invigilation relies on human observation, which is in...
Effective human behavior modeling requires a representation of the human body movement that capitalizes on its compositionality. We propose a hierarchic...
Large vision--language models (VLMs) are increasingly applied to long-video question answering, yet inference is often bottlenecked by the number of inp...
We introduce AEGIS, A holistic benchmark for Evaluating forensic analysis of AI-Generated academic ImageS. Compared to existing benchmarks, AEGIS featur...
The expansion of retrieval-augmented generation (RAG) into multimodal domains has intensified the challenge for processing complex visual documents, suc...
As forgery types continue to emerge consistently, Incremental Face Forgery Detection (IFFD) has become a crucial paradigm. However, existing methods typ...
Counting immunopositive cells on biological tissues generally requires either manual annotation or (when available) automatic rough systems, for scannin...
Synthesizing physically plausible articulated human-object interactions (HOI) without 3D/4D supervision remains a fundamental challenge. While recent ze...
Orofacial clefts are among the most common congenital craniofacial abnormalities, yet accurate prenatal detection remains challenging due to the scarcit...
Precise and real-time visual localization is critical for applications like AR/VR and robotics, especially on resource-constrained edge devices such as...
When humans perceive the world, they naturally integrate multiple audio-visual tasks within dynamic, real-world scenes. However, current works such as e...
Deep learning in cardiac MRI (CMR) is fundamentally constrained by both data scarcity and privacy regulations. This study systematically benchmarks thre...
Predicting the manufacturability of CAD designs early, in terms of both feasibility and required effort, is a key goal of Design for Manufacturing (DFM)...
The integration of Large Language Models (LLMs) into autonomous driving has attracted growing interest for their strong reasoning and semantic understan...
We introduce **CRYSTAL** (*__C__lear __R__easoning via __Y__ielded __S__teps, __T__raceability and __L__ogic*), a diagnostic benchmark with 6,372 instan...
Modern visual world modeling systems increasingly rely on high-capacity architectures and large-scale data to produce plausible motion, yet they often f...
Density aggregation is a central problem in machine learning, for instance when combining predictions from a Deep Ensemble. The choice of aggregation re...
Night Photography Rendering (NPR) poses a significant challenge due to the extreme contrast between dark and illuminated areas in scenes, stemming from...
It is currently difficult to distill discrete diffusion models. In contrast, continuous diffusion literature has many distillation approaches methods th...
Conditional image editing aims to modify a source image according to textual prompts and optional reference guidance. Such editing is crucial in scenari...
Automated building facade inspection is a critical component of urban resilience and smart city maintenance. Traditionally, this field has relied on spe...
Conventional fine-tuning on domain-specific datasets can inadvertently alter a model's pretrained multimodal priors, leading to reduced generalization....
Interactive segmentation enables clinicians to guide annotation, but existing zero-shot models like nnInteractive fail to consistently reach expert-leve...
Time-to-Collision (TTC) forecasting is a critical task in collision prevention, requiring precise temporal prediction and comprehending both local and g...
Recent vision-language models (VLMs) typically rely on a single vision encoder trained with contrastive image-text objectives, such as CLIP-style pretra...
Compositional generalization, the ability to recognize familiar parts in novel contexts, is a defining property of intelligent systems. Although modern...
Topological features play an essential role in ensuring geometric plausibility and structural consistency in image analysis tasks such as segmentation a...
Composed Video Retrieval (CoVR) aims to find a target video given a reference video and a textual modification. Prior work assumes the modification text...
Generative models trained on sensitive image datasets risk memorizing and reproducing individual training examples, making strong privacy guarantees ess...
Consumer LiDARs in mobile devices and robots typically output a single depth value per pixel. Yet internally, they record full time-resolved histograms...
Panoramic radiography is a fundamental diagnostic tool in dentistry, offering a comprehensive view of the entire dentition with minimal radiation exposu...
Many segmentation tasks, such as medical image segmentation or future state prediction, are inherently ambiguous, meaning that multiple predictions are...
Brain tumor classification from magnetic resonance imaging, which is also known as MRI, plays a sensitive role in computer-assisted diagnosis systems. I...
Diffusion-based image compression has recently shown outstanding perceptual fidelity, yet its practicality is hindered by prohibitive sampling overhead...
Vision-language models encode continuous geometry that their text pathway fails to express: a 6,000-parameter linear probe extracts hand joint angles at...
Reasoning in vision-language models (VLMs) has recently attracted significant attention due to its broad applicability across diverse downstream tasks....
In this study, we proposed a deep Swin-Vision Transformer-based transfer learning architecture for robust multi-cancer histopathological image classific...
Predicting product quality from multimodal item information is critical in cold-start scenarios, where user interaction history is unavailable and predi...
Generative world models have shown promise for simulating dynamic environments, yet egocentric video remains challenging due to rapid viewpoint changes,...
Egocentric video understanding is inherently complex due to the dynamic 4D nature of the environment, where camera motion and object displacements neces...
Large foundation models have made significant advances in embodied intelligence, enabling synthesis and reasoning over egocentric input for household ta...
Attributing authorship to paintings is a historically complex task, and one of its main challenges is the limited availability of real artworks for trai...
Atmospheric haze significantly degrades wildlife imagery, impeding computer vision applications critical for conservation, such as animal detection, tra...
Recent progress in text-to-image generation has greatly advanced visual fidelity and creativity, but it has also imposed higher demands on prompt comple...
Accurately anticipating how complex, diverse scenes will evolve requires models that represent uncertainty, simulate along extended interaction chains,...
Single-point supervised infrared small target detection (IRSTD) drastically reduces dense annotation costs. Current state-of-the-art (SOTA) methods achi...
Face de-identification (FDeID) aims to remove personally identifiable information from facial images while preserving task-relevant utility attributes s...
Reinforcement learning has advanced video reasoning in large multi-modal models, yet dominant pipelines either rely on on-policy self-exploration, which...
UAV vision-language navigation (VLN) requires an agent to navigate complex 3D environments from an egocentric perspective while following ambiguous mult...
Reinforcement learning (RL) has become a standard technique for post-training diffusion-based image synthesis models, as it enables learning from reward...
Decoupled dataset distillation (DD) compresses large corpora into a few synthetic images by matching a frozen teacher's statistics. However, current res...
The ability to understand long videos is vital for embodied intelligent agents, because their effectiveness depends on how well they can accumulate, org...
Aerosol Optical Depth (AOD) retrieval is essential for Earth observation, supporting applications from air quality monitoring to climate studies. Conven...
What is a diffusion model actually doing when it turns noise into a photograph? We show that the deterministic DDIM reverse chain operates as a Partitio...
Existing tampering detection benchmarks largely rely on object masks, which severely misalign with the true edit signal: many pixels inside a mask are u...
Understanding affective dynamics in real-world social systems is fundamental to modeling and analyzing human-human interactions in complex environments....
Reconstructing 3D scenes from sparse, unposed images remains challenging under real-world conditions with varying illumination and transient occlusions....
Reconstructing photorealistic and animatable 4D head avatars from a single portrait image remains a fundamental challenge in computer vision. While diff...
Camera motion is a fundamental geometric signal that shapes visual perception and cinematic style, yet current video-capable vision-language models (Vid...
Gaze estimation methods commonly use facial appearances to predict the direction of a person gaze. However, previous studies show three major challenges...
We introduce Helios, the first 14B video generation model that runs at 19.5 FPS on a single NVIDIA H100 GPU and supports minute-scale generation while m...
Driving world models serve as a pivotal technology for autonomous driving by simulating environmental dynamics. However, existing approaches predominant...
Underwater images often suffer from severe degradation, such as color distortion, low contrast, and blurred details, due to light absorption and scatter...
Humans perceive actions through key transitions that structure actions across multiple abstraction levels, whereas machines, relying on visual features,...
Batch effects arising from technical variations in histopathology staining protocols, scanners, and acquisition pipelines pose a persistent challenge fo...
In many CLIP adaptation methods, a blending ratio hyperparameter controls the trade-off between general pretrained CLIP knowledge and the limited, datas...
We present a method for generating a full 360° orbit video around a person from a single input image. Existing methods typically adapt image-based diffu...
Non-contrast chest CTs offer a rich opportunity for both conventional pulmonary and opportunistic extra-pulmonary screening. While Multi-Task Learning (...
In this work, we propose Image-to-Image Rectified Flow Reformulation (I2I-RFR), a practical plug-in reformulation that recasts standard I2I regression n...
Geometric high-fidelity mesh reconstruction from LiDAR-inertial scans remains challenging in large, complex indoor environments -- such as cultural buil...
Understanding the theoretical foundations of attention mechanisms remains challenging due to their complex, non-linear dynamics. This work reveals a fun...
Vision Language models (VLMs) have demonstrated strong performance across a wide range of benchmarks, yet they often suffer from modality dominance, whe...
Standard diffusion models for graph generation typically rely on uniform time-stepping, an approach that overlooks the non-homogeneous dynamics of distr...
We present a training-free approach for controllable 3D inpainting based on initial noise optimization. In the structured 3D latent diffusion framework,...
Diffusion-based Real-World Image Super-Resolution (Real-ISR) achieves impressive perceptual quality but suffers from high computational costs due to ite...
Recent work has shown that neural networks can perform 3D tasks such as Novel View Synthesis (NVS) without explicit 3D reconstruction. Even so, we argue...
Which multimodal model should we use for classification? Previous studies suggest that the answer lies in CLIP-like contrastive Vision-Language Models (...
Vision-Language-Action (VLA) models have increasingly incorporated reasoning mechanisms for complex robotic manipulation. However, existing approaches s...
Knee osteoarthritis (OA) assessment involves a natural but often underused label hierarchy: a coarse binary OA decision and a fine-grained Kellgren--Law...
In this paper, we present \textbf{Gen}erative \textbf{L}anguage-\textbf{I}mage \textbf{P}re-training (GenLIP), a minimalist generative pretraining frame...
The accurate and automatic extraction of roads from satellite imagery is critical for applications in navigation and urban planning, significantly reduc...
Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained contr...
Key-Value (KV) cache has become a de facto component of modern Large Vision-Language Models (LVLMs) for inference. While it enhances decoding efficiency...
High-dimensional images, or images with a high-dimensional attribute vector per pixel, are commonly explored with coordinated views of a low-dimensional...
In recent times, large datasets hinder efficient model training while also containing redundant concepts. Dataset distillation aims to synthesize compac...
3D world generation is essential for applications such as immersive content creation or autonomous driving simulation. Recent advances in 3D world gener...
We introduce MediX-R1, an open-ended Reinforcement Learning (RL) framework for medical multimodal large language models (MLLMs) that enables clinically...
Video generative models show emerging reasoning behaviors. It is essential to ensure that generated events remain causally consistent across frames for...
Recent methods for arbitrary-skeleton motion capture from monocular video follow a factorized pipeline, where a Video-to-Pose network predicts joint pos...
Scaling video generation from seconds to minutes faces a critical bottleneck: while short-video data is abundant and high-fidelity, coherent long-form d...
Next-generation autonomous vehicles (AVs) rely on large volumes of multisource and multimodal ($M^2$) data to support real-time decision-making. In prac...
Urban perception describes how people subjectively evaluate urban environments, shaping how cities are experienced and understood. Existing computationa...
Animal re-identification (ReID) faces critical challenges due to viewpoint variations, particularly in Aerial-Ground (AG-ReID) settings where models mus...
With the explosive growth of digital entertainment, automated video summarization has become indispensable for applications such as content indexing, pe...
Robotic manipulation requires understanding both the 3D spatial structure of the environment and its temporal evolution, yet most existing policies over...
Multimodal Large Language Models (MLLM) classification performance depends critically on evaluation protocol and ground truth quality. Studies comparing...
Recognizing surgical phases and steps from video is a fundamental problem in computer-assisted interventions. Recent approaches increasingly rely on lar...
Video-driven human reaction generation aims to synthesize 3D human motions that directly react to observed video sequences, which is crucial for buildin...
Modern microscopy routinely produces gigapixel images that contain structures across multiple spatial scales, from fine cellular morphology to broader t...
Negation is a fundamental linguistic operator, yet it remains inadequately modeled in diffusion-based generative systems. In this work, we present a for...
This paper presents NOIR, a framework that reframes core medical imaging tasks as operator learning between continuous function spaces, challenging the...
While recent multimodal large language models (MLLMs) have made impressive strides, they predominantly employ a conventional autoregressive architecture...
Human-robot collaboration has been studied primarily in dyadic or sequential settings. However, real homes require multiadic collaboration, where multip...
We present Online3R, a new sequential reconstruction framework that is capable of adapting to new scenes through online learning, effectively resolving...
Evolutions in the world, such as water pouring or ice melting, happen regardless of being observed. Video world models generate worlds via 2D frame ob...
Panoramic imagery provides holistic 360° visual coverage for perception in quadruped robots. However, existing occupancy prediction methods are mainly d...
Vision Language Model (VLM) development has largely relied on scaling model size, which hinders deployment on compute-constrained mobile and edge device...
In modern human-robot collaboration (HRC) applications, multiple perception modules jointly extract visual, auditory, and contextual cues to achieve com...
While autoregressive Large Vision-Language Models (LVLMs) demonstrate remarkable proficiency in multimodal tasks, they face a 'Visual Signal Dilution' p...
Immunohistochemical (IHC) staining enables precise molecular profiling of protein expression, with over 200 clinically available antibody-based tests in...
Modern video diffusion models excel at appearance synthesis but still struggle with physical consistency: objects drift, collisions lack realistic rebou...
Image editing instructions are heterogeneous: a color swap, an object insertion, and a physical-action edit all demand different spatial coverage and di...
Recent progress in text-conditioned human motion generation has been largely driven by diffusion models trained on large-scale human motion data. Buildi...
Plug-and-Play diffusion prior (PnPDP) frameworks have emerged as a powerful paradigm for solving imaging inverse problems by treating pretrained generat...
Constructing computer-aided design (CAD) models is labor-intensive but essential for engineering and manufacturing. Recent advances in Large Language Mo...
Flow matching (FM) trains a time-dependent vector field that transports samples from a simple prior to a complex data distribution. However, for high-di...
Three-dimensional medical image data and computer-aided decision making, particularly using deep learning, are becoming increasingly important in the me...
Medical diagnosis requires the effective synthesis of visual manifestations and clinical metadata. However, existing methods often treat metadata as iso...
The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforc...
Remote sensing segmentation in real deployment is inherently continual: new semantic categories emerge, and acquisition conditions shift across seasons,...
Recent significant advances in 3D scene representation have been driven by 3D Gaussian Splatting (3DGS), which has enabled real-time rendering with phot...
Edge detection refers to identifying points in a digital image where intensity changes sharply, indicating object boundaries or structural features. Cor...
Pathology report generation remains a relatively under-explored downstream task, primarily due to the gigapixel scale and complex morphological heteroge...
Vision transformers have recently made a breakthrough in computer vision showing excellent performance in terms of precision for numerous applications....
Recovering camera parameters from images and rendering scenes from novel viewpoints have long been treated as separate tasks in computer vision and grap...
Fully immersive experiences that tightly integrate 6-DoF visual and auditory interaction are essential for virtual and augmented reality. While such exp...
We show that Fréchet Distance (FD), long considered impractical as a training objective, can in fact be effectively optimized in the representation spac...
Machine learning approaches to spatiotemporal physical systems have primarily focused on next-frame prediction, with the goal of learning an accurate em...
We introduce LaviGen, a framework that repurposes 3D generative models for 3D layout generation. Unlike previous methods that infer object layouts from...
From CLIP to GPT-4V to Gemini. Read the 9 most important multimodal AI papers in order — understanding how vision and language were unified.
Open-vocabulary segmentation (OVS) extends the zero-shot recognition capabilities of vision-language models (VLMs) to pixel-level prediction, enabling s...
Universal image restoration (UIR) aims to recover clean images from diverse and unknown degradations using a unified model. Existing UIR methods primari...
With advances in imitation learning (IL) and large-scale driving datasets, end-to-end autonomous driving (E2E-AD) has made great progress recently. Curr...
Distilling video generation models to extremely low inference budgets (e.g., 2--4 NFEs) is crucial for real-time deployment, yet remains challenging. Tr...
Evaluation of AI systems often requires synthetic test cases, particularly for rare or safety-critical conditions that are difficult to observe in opera...
The lack of reasoning capabilities in Vision-Language Models (VLMs) has remained at the forefront of research discourse. We posit that this behavior ste...
Cross-modal Thermal Geo-localization (TG) provides a robust, all-weather solution for Unmanned Aerial Vehicles (UAVs) in Global Navigation Satellite Sys...
Incremental Few-Shot (IFS) segmentation aims to learn new categories over time from only a few annotations. Although widely studied in 2D, it remains un...
Diffusion models trained on noisy datasets often reproduce high-frequency training artifacts, significantly degrading generation quality. To address thi...
Few-Shot Medical Image Segmentation (FSMIS) aims to segment novel object classes in medical images using only minimal annotated examples, addressing the...
Prompt learning is a parameter-efficient approach for vision-language models, yet its robustness under label noise is less investigated. Visual content...
We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation. It is essential for synthesizing partially...
Strong semantic representations improve the convergence and generation quality of diffusion and flow models. Existing approaches largely rely on externa...
Diffusion models achieve state-of-the-art video generation quality, but their inference remains expensive due to the large number of sequential denoisin...
Bio-inspired event cameras have recently attracted significant research due to their asynchronous and low-latency capabilities. These features provide a...
Object detection in unmanned aerial vehicle (UAV) images remains a highly challenging task, primarily caused by the complexity of background noise and t...
Robotic harvesting in dense crop canopies requires effective interventions that depend not only on geometry, but also on explicit, direction-conditioned...
Human motion prediction combines the tasks of trajectory forecasting and human pose prediction. For each of the two tasks, specialized models have been...
Human action recognition plays an important role when developing intelligent interactions between humans and machines. While there is a lot of active re...
Diffuse direct time-of-flight LiDARs report per-pixel depth histograms formed by aggregating photon returns over a wide instantaneous field of view, vio...
Pure-vision GUI agents provide universal interaction capabilities but suffer from severe efficiency bottlenecks due to the massive spatiotemporal redund...
Real-world systems must continuously adapt to novel concepts from limited data without forgetting previously acquired knowledge. While Few-Shot Class-In...
Temporal Video Grounding (TVG) aims to localize temporal moments in an untrimmed video that semantically correspond to given natural language queries. R...
Bronchoscopic navigation relies on registering endoscopic video to a preoperative CT scan, but respiratory motion deforms the airway by 5-20 mm, creatin...
One of the dominant paradigms in self-supervised learning (SSL), illustrated by MoCo or DINO, aims to produce robust representations by capturing featur...
Surgeons don't just see -- they interpret. When an expert observes a surgical scene, they understand not only what instrument is being used, but why it...
We introduce SurgFormer, a multiresolution gated transformer for data driven soft tissue simulation on volumetric meshes. High fidelity biomechanical so...
Token pruning has emerged as a mainstream approach for developing efficient Video Large Language Models (Video LLMs). This work revisits and advances th...
Traditional vision-language models struggle with contrastive fine-grained taxonomic reasoning, particularly when distinguishing between visually similar...
This study explores artificial visual creativity, focusing on ChatGPT's ability to generate new images intentionally pastiching original artworks such a...
Scaling Vision-Language-Action (VLA) models by upgrading the vision encoder is expected to improve downstream manipulation performance--as it does in vi...
This paper reviews the NTIRE 2026 challenge on efficient single-image super-resolution with a focus on the proposed solutions and results. The aim of th...
Omni-modal reasoning is essential for intelligent systems to understand and draw inferences from diverse data sources. While existing omni-modal large l...
Earth observation (EO) missions traditionally rely on transmitting raw or minimally processed imagery from satellites to ground stations for computation...
Vision-language models (VLMs) show promise in drafting radiology reports, yet they frequently suffer from logical inconsistencies, generating diagnostic...
Concept Bottleneck Models (CBMs) are interpretable models that route predictions through a layer of human-interpretable concepts. While widely studied i...
In real scenarios, videos can span several minutes or even hours. However, existing research on spatio-temporal video grounding (STVG), given a textual...
Spatio-temporal scene graphs provide a principled representation for modeling evolving object interactions, yet existing methods remain fundamentally fr...
Dense 4D reconstruction from unposed images remains a critical challenge, with current methods relying on slow test-time optimization or fragmented, tas...
Despite their capabilities, Multimodal Large Language Models (MLLMs) may produce plausible but erroneous outputs, hindering reliable deployment. Accurat...
Large-scale Vision-Language Foundation Models (VLFMs), such as CLIP, now underpin a wide range of computer vision research and applications. VLFMs are o...
The recent success of reinforcement learning (RL) in large reasoning models has inspired the growing adoption of RL for post-training Multimodal Large L...
We present UniScale, a unified, scale-aware multi-view 3D reconstruction framework for robotic applications that flexibly integrates geometric priors th...
Unsupervised deraining has attracted attention for its ability to learn the real-world distribution of rain without paired supervision. However, the lac...
With the development of deep learning, medical image processing has been widely used to assist clinical research. This paper focuses on the denoising pr...
Sparse regularization plays a central role in solving inverse problems arising from incomplete or corrupted measurements. Different regularizers corresp...
We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requi...
Video agentic models have advanced challenging video-language tasks. However, most agentic approaches still heavily rely on greedy parsing over densely...
Vision-language models (VLMs) still struggle with visual perception tasks such as spatial understanding and viewpoint recognition. One plausible contrib...
Visual Retrieval-Augmented Generation (VRAG) empowers Vision-Language Models to retrieve and reason over visually rich documents. To tackle complex quer...
Recent visual generation models have made major progress in photorealism, typography, instruction following, and interactive editing, yet they still str...
Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representat...
Large Vision Language Models (LVLMs) achieve strong multimodal reasoning but frequently exhibit hallucinations and incorrect responses with high certain...
Most of the recent generative image super-resolution (SR) methods rely on adapting large text-to-image (T2I) diffusion models pretrained on web-scale te...
Image geolocalization has traditionally been addressed through retrieval-based place recognition or geometry-based visual localization pipelines. Recent...
Predicting future states in uncertain environments, such as wildfire spread, medical diagnosis, or autonomous driving, requires models that can consider...
Feed-forward transformer models have driven rapid progress in 3D vision, but state-of-the-art methods such as VGGT and $π^3$ have a computational cost t...