"Taking Stock at FAccT": Using Participatory Design to Co-Create a Vision for the Fairness, Accountability and Transparency Community
As a relatively new forum, ACM FAccT has become a key space for activists and scholars to critically examine emerging AI and ML technologies. It brings...
$τ$-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge
Conversational agents are increasingly deployed in knowledge-intensive settings, where correct behavior depends on retrieving and applying domain-specif...
A 1/R Law for Kurtosis Contrast in Balanced Mixtures
Kurtosis-based Independent Component Analysis (ICA) weakens in wide, balanced mixtures. We prove a sharp redundancy law: for a standardized projection w...
A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring
Large language models are beginning to show steganographic capabilities. Such capabilities could allow misaligned models to evade oversight mechanisms....
A Dual-Helix Governance Approach Towards Reliable Agentic AI for WebGIS Development
WebGIS development requires rigor, yet agentic AI frequently fails due to five large language model (LLM) limitations: context constraints, cross-sessio...
A Minimal Agent for Automated Theorem Proving
We propose a minimal agentic baseline that enables systematic comparison across different AI-based theorem prover architectures. This design implements...
A Mixed Diet Makes DINO An Omnivorous Vision Encoder
Pre-trained vision encoders like DINOv2 have demonstrated exceptional performance on unimodal tasks. However, we observe that their feature representati...
A Quantitative Characterization of Forgetting in Post-Training
Continual post-training of generative models is widely used, yet a principled understanding of when and why forgetting occurs remains limited. We develo...
A Reference Architecture of Reinforcement Learning Frameworks
The surge in reinforcement learning (RL) applications gave rise to diverse supporting technology, such as RL frameworks. However, the architectural patt...
A Systematic Security Evaluation of OpenClaw and Its Variants
Tool-augmented AI agents substantially extend the practical capabilities of large language models, but they also introduce security risks that cannot be...
A Two-Stage, Object-Centric Deep Learning Framework for Robust Exam Cheating Detection
Academic integrity continues to face the persistent challenge of examination cheating. Traditional invigilation relies on human observation, which is in...
Abductive Reasoning with Syllogistic Forms in Large Language Models
Research in AI using Large-Language Models (LLMs) is rapidly evolving, and the comparison of their performance with human reasoning has become a key con...
Adaptive Greedy Frame Selection for Long Video Understanding
Large vision--language models (VLMs) are increasingly applied to long-video question answering, yet inference is often bottlenecked by the number of inp...
Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention
Transformer attention is typically implemented using softmax normalization, which enforces attention weights with unit sum normalization. While effectiv...
Agent Evaluation
Measuring LLM agent performance through trajectory analysis, benchmark suites, LLM-as-judge, failure taxonomies, and production monitoring strategies.
Agent Safety and Guardrails
Implementing defense-in-depth safety for production LLM agents - prompt injection defense, input/output guardrails, tool sandboxing, HITL confirmation, and audit logging.
AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning
While Multi-Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information generated by individual parti...
AI Agents Can Already Autonomously Perform Experimental High Energy Physics
Large language model-based AI agents are now able to autonomously execute substantial portions of a high energy physics (HEP) analysis pipeline with min...
AI-Assisted Unit Test Writing and Test-Driven Code Refactoring: A Case Study
Many software systems originate as prototypes or minimum viable products (MVPs), developed with an emphasis on delivery speed and responsiveness to chan...
AIFIND: Artifact-Aware Interpreting Fine-Grained Alignment for Incremental Face Forgery Detection
As forgery types continue to emerge consistently, Incremental Face Forgery Detection (IFFD) has become a crucial paradigm. However, existing methods typ...
Amortized Optimal Transport from Sliced Potentials
We propose a novel amortized optimization method for predicting optimal transport (OT) plans across multiple pairs of measures by leveraging Kantorovich...
An Agentic Multi-Agent Architecture for Cybersecurity Risk Management
Getting a real cybersecurity risk assessment for a small organization is expensive -- a NIST CSF-aligned engagement runs $15,000 on the low end, takes w...
An Efficient Unsupervised Federated Learning Approach for Anomaly Detection in Heterogeneous IoT Networks
Federated learning (FL) is an effective paradigm for distributed environments such as the Internet of Things (IoT), where data from diverse devices with...
An Empirical Study of SFT-DPO Interaction and Parameterization in Small Language Models
Direct Preference Optimization (DPO) is widely used after supervised fine-tuning (SFT) to align language models, yet empirical behavior under small back...
An Independent Safety Evaluation of Kimi K2.5
Kimi K2.5 is an open-weight LLM that rivals closed models across coding, multimodal, and agentic benchmarks, but was released without an accompanying sa...
ArgLLM-App: An Interactive System for Argumentative Reasoning with Large Language Models
Argumentative LLMs (ArgLLMs) are an existing approach leveraging Large Language Models (LLMs) and computational argumentation for decision-making, with...
ARGUS: Seeing the Influence of Narrative Features on Persuasion in Argumentative Texts
Can narratives make arguments more persuasive? And to this end, which narrative features matter most? Although stories are often seen as powerful tools...
Artificial Intelligence for Detecting Fetal Orofacial Clefts and Advancing Medical Education
Orofacial clefts are among the most common congenital craniofacial abnormalities, yet accurate prenatal detection remains challenging due to the scarcit...
ASMR-Bench: Auditing for Sabotage in ML Research
As AI systems are increasingly used to conduct research autonomously, misaligned systems could introduce subtle flaws that produce misleading results wh...
AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency
Large language models (LLMs) increasingly rely on chain-of-thought (CoT) reasoning to solve complex tasks. Yet ensuring that the reasoning trace both co...
Augmented Lagrangian Multiplier Network for State-wise Safety in Reinforcement Learning
Safety is a primary challenge in real-world reinforcement learning (RL). Formulating safety requirements as state-wise constraints has become a prominen...
BAGEL: Benchmarking Animal Knowledge Expertise in Language Models
Large language models have shown strong performance on broad-domain knowledge and reasoning benchmarks, but it remains unclear how well language models...
BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation
Accurate evaluation is central to the large language model (LLM) ecosystem, guiding model selection and downstream adoption across diverse use cases. In...
BEVLM: Distilling Semantic Knowledge from LLMs into Bird's-Eye View Representations
The integration of Large Language Models (LLMs) into autonomous driving has attracted growing interest for their strong reasoning and semantic understan...
Beyond Distribution Sharpening: The Importance of Task Rewards
Frontier models have demonstrated exceptional capabilities following the integration of task-reward-based reinforcement learning (RL) into their trainin...
Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation
We introduce **CRYSTAL** (*__C__lear __R__easoning via __Y__ielded __S__teps, __T__raceability and __L__ogic*), a diagnostic benchmark with 6,372 instan...
Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations
Large language models are increasingly deployed in settings where reliability matters, yet output-level uncertainty signals such as token probabilities,...
Beyond the Parameters: A Technical Survey of Contextual Enrichment in Large Language Models: From In-Context Prompting to Causal Retrieval-Augmented Generation
Large language models (LLMs) encode vast world knowledge in their parameters, yet they remain fundamentally limited by static knowledge, finite context...
Bitwise Systolic Array Architecture for Runtime-Reconfigurable Multi-precision Quantized Multiplication on Hardware Accelerators
Neural network accelerators have been widely applied to edge devices for complex tasks like object tracking, image recognition, etc. Previous works have...
Boosting deep Reinforcement Learning using pretraining with Logical Options
Deep reinforcement learning agents are often misaligned, as they over-exploit early reward signals. Recently, several symbolic approaches have addressed...
BoSS: A Best-of-Strategies Selector as an Oracle for Deep Active Learning
Active learning (AL) aims to reduce annotation costs while maximizing model performance by iteratively selecting valuable instances. While foundation mo...
Budget-Sensitive Discovery Scoring: A Formally Verified Framework for Evaluating AI-Guided Scientific Selection
Scientific discovery increasingly relies on AI systems to select candidates for expensive experimental validation, yet no principled, budget-aware evalu...
Can Coding Agents Reproduce Findings in Computational Materials Science?
Large language models are increasingly deployed as autonomous coding agents and have achieved remarkably strong performance on software engineering benc...
Can LLMs Understand the Impact of Trauma? Costs and Benefits of LLMs Coding the Interviews of Firearm Violence Survivors
Firearm violence is a pressing public health issue, yet research into survivors' lived experiences remains underfunded and difficult to scale. Qualitati...
Case-Grounded Evidence Verification: A Framework for Constructing Evidence-Sensitive Supervision
Evidence-grounded reasoning requires more than attaching retrieved text to a prediction: a model should make decisions that depend on whether the provid...
Causality Elicitation from Large Language Models
Large language models (LLMs) are trained on enormous amounts of data and encode knowledge in their parameters. We propose a pipeline to elicit causal re...
Chain-of-Adaptation: Surgical Vision-Language Adaptation with Reinforcement Learning
Conventional fine-tuning on domain-specific datasets can inadvertently alter a model's pretrained multimodal priors, leading to reduced generalization....
Characterising LLM-Generated Competency Questions: a Cross-Domain Empirical Study using Open and Closed Models
Competency Questions (CQs) are a cornerstone of requirement elicitation in ontology engineering. CQs represent requirements as a set of natural language...
Chart-RL: Policy Optimization Reinforcement Learning for Enhanced Visual Reasoning in Chart Question Answering with Vision Language Models
The recent advancements in Vision Language Models (VLMs) have demonstrated progress toward true intelligence requiring robust reasoning capabilities. Be...
ChemGraph-XANES: An Agentic Framework for XANES Simulation and Analysis
Computational X-ray absorption near-edge structure (XANES) is widely used to probe local coordination environments, oxidation states, and electronic str...
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows
LLM agents are expected to complete end-to-end units of work across software tools, business services, and local workspaces. Yet many agent benchmarks f...
CLoPA: Continual Low Parameter Adaptation of Interactive Segmentation for Medical Image Annotation
Interactive segmentation enables clinicians to guide annotation, but existing zero-shot models like nnInteractive fail to consistently reach expert-leve...
Clustering Astronomical Orbital Synthetic Data Using Advanced Feature Extraction and Dimensionality Reduction Techniques
The dynamics of Saturn's satellite system offer a rich framework for studying orbital stability and resonance interactions. Traditional methods for anal...
COLD-Steer: Steering Large Language Models via In-Context One-step Learning Dynamics
Activation steering methods enable inference-time control of large language model (LLM) behavior without retraining, but current approaches face a funda...
CoME: Empowering Channel-of-Mobile-Experts with Informative Hybrid-Capabilities Reasoning
Mobile Agents can autonomously execute user instructions, which requires hybrid-capabilities reasoning, including screen summary, subtask planning, acti...
Competition-Aware CPC Forecasting with Near-Market Coverage
Cost-per-click (CPC) in paid search is a volatile auction outcome generated by a competitive landscape that is only partially observable from any single...
Computing Equilibrium beyond Unilateral Deviation
Most familiar equilibrium concepts, such as Nash and correlated equilibrium, guarantee only that no single player can improve their utility by deviating...
Conformalized Neural Networks for Federated Uncertainty Quantification under Dual Heterogeneity
Federated learning (FL) faces challenges in uncertainty quantification (UQ). Without reliable UQ, FL systems risk deploying overconfident models at unde...
Controllable Reasoning Models Are Private Thinkers
AI agents powered by reasoning models require access to sensitive user data. However, their reasoning traces are difficult to control, which can result...
Coupled Control, Structured Memory, and Verifiable Action in Agentic AI (SCRAT -- Stochastic Control with Retrieval and Auditable Trajectories): A Comparative Perspective from Squirrel Locomotion and Scatter-Hoarding
Agentic AI is increasingly judged not by fluent output alone but by whether it can act, remember, and verify under partial observability, delay, and str...
Crab: A Semantics-Aware Checkpoint/Restore Runtime for Agent Sandboxes
Autonomous agents act through sandboxed containers and microVMs whose state spans filesystems, processes, and runtime artifacts. Checkpoint and restore...
CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation
GPU kernel optimization is fundamental to modern deep learning but remains a highly specialized task requiring deep hardware expertise. Despite strong p...
CXReasonAgent: Evidence-Grounded Diagnostic Reasoning Agent for Chest X-rays
Chest X-ray plays a central role in thoracic diagnosis, and its interpretation inherently requires multi-step, evidence-grounded reasoning. However, lar...
DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science
The fast-growing demands in using Large Language Models (LLMs) to tackle complex multi-step data science tasks create an emergent need for accurate benc...
Data Driven Optimization of GPU efficiency for Distributed LLM Adapter Serving
Large Language Model (LLM) adapters enable low-cost model specialization, but introduce complex caching and scheduling challenges in distributed serving...
daVinci-Env: Open SWE Environment Synthesis at Scale
Training capable software engineering (SWE) agents demands large-scale, executable, and verifiable environments that provide dynamic feedback loops for...
Decentralized Ranking Aggregation: Gossip Algorithms for Borda and Copeland Consensus
The concept of ranking aggregation plays a central role in preference analysis, and numerous algorithms for calculating median rankings, often originati...
DEFault++: Automated Fault Detection, Categorization, and Diagnosis for Transformer Architectures
Transformer models are widely deployed in critical AI applications, yet faults in their attention mechanisms, projections, and other internal components...
Design-OS: A Specification-Driven Framework for Engineering System Design with a Control-Systems Design Case
Engineering system design -- whether mechatronic, control, or embedded -- often proceeds in an ad hoc manner, with requirements left implicit and tracea...
Developing and evaluating a chatbot to support maternal health care
The ability to provide trustworthy maternal health information using phone-based chatbots can have a significant impact, particularly in low-resource se...
Developing the PsyCogMetrics AI Lab to Evaluate Large Language Models and Advance Cognitive Science -- A Three-Cycle Action Design Science Study
This study presents the development of the PsyCogMetrics AI Lab (psycogmetrics.ai), an integrated, cloud-based platform that operationalizes psychometri...
Directed Social Regard: Surfacing Targeted Advocacy, Opposition, Aid, Harms, and Victimization in Online Media
The language in online platforms, influence operations, and political rhetoric frequently directs a mix of pro-social sentiment (e.g., advocacy, helpful...
Dissecting Quantization Error: A Concentration-Alignment Perspective
Quantization can drastically increase the efficiency of large language and vision models, but typically incurs an accuracy drop. Recently, function-pres...
Do Foundation Models Know Geometry? Probing Frozen Features for Continuous Physical Measurement
Vision-language models encode continuous geometry that their text pathway fails to express: a 6,000-parameter linear probe extracts hand joint angles at...
Do LLMs Benefit From Their Own Words?
Multi-turn interactions with large language models typically retain the assistant's own past responses in the conversation history. In this work, we rev...
Do Sparse Autoencoders Capture Concept Manifolds?
Sparse autoencoders (SAEs) are widely used to extract interpretable features from neural network representations, often under the implicit assumption th...
Domain-Adapted Retrieval for In-Context Annotation of Pedagogical Dialogue Acts
Automated annotation of pedagogical dialogue is a high-stakes task where LLMs often fail without sufficient domain grounding. We present a domain-adapte...
Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks
Multimodal web agents that process both screenshots and accessibility trees are increasingly deployed to interact with web interfaces, yet their dual-st...
E3-TIR: Enhanced Experience Exploitation for Tool-Integrated Reasoning
While Large Language Models (LLMs) have demonstrated significant potential in Tool-Integrated Reasoning (TIR), existing training paradigms face signific...
EASE: Federated Multimodal Unlearning via Entanglement-Aware Anchor Closure
Federated Multimodal Learning (FML) trains multimodal models across decentralized clients while keeping their image-text pairs private. However, joint e...
ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion
Chest X-ray report generation (CXR-RG) has the potential to substantially alleviate radiologists' workload. However, conventional autoregressive vision-...
Efficient Discovery of Approximate Causal Abstractions via Neural Mechanism Sparsification
Neural networks are hypothesized to implement interpretable causal mechanisms, yet verifying this requires finding a causal abstraction -- a simpler, hi...
Efficient Refusal Ablation in LLM through Optimal Transport
Safety-aligned language models refuse harmful requests through learned refusal behaviors encoded in their internal representations. Recent activation-ba...
Empowering Heterogeneous Graph Foundation Models via Decoupled Relation Alignment
While Graph Foundation Models (GFMs) have achieved remarkable success in homogeneous graphs, extending them to multi-domain heterogeneous graphs (MDHGs)...
Enhancing Hyperspace Analogue to Language (HAL) Representations via Attention-Based Pooling for Text Classification
The Hyperspace Analogue to Language (HAL) model relies on global word co-occurrence matrices to construct distributional semantic representations. While...
Enhancing Robustness of Federated Learning via Server Learning
This paper explores the use of server learning for enhancing the robustness of federated learning against malicious attacks even when clients' training...
Envisioning the Future, One Step at a Time
Accurately anticipating how complex, diverse scenes will evolve requires models that represent uncertainty, simulate along extended interaction chains,...
ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation
As corporate responsibility increasingly incorporates environmental, social, and governance (ESG) criteria, ESG reporting is becoming a legal requiremen...
Evaluating Stochasticity in Deep Research Agents
Deep Research Agents (DRAs) are promising agentic systems that gather and synthesize information to support research across domains such as financial de...
Evaluating Zero-Shot and One-Shot Adaptation of Small Language Models in Leader-Follower Interaction
Leader-follower interaction is an important paradigm in human-robot interaction (HRI). Yet, assigning roles in real time remains challenging for resourc...
Evolving Jailbreaks: Automated Multi-Objective Long-Tail Attacks on Large Language Models
Large Language Models (LLMs) have been widely deployed, especially through free Web-based applications that expose them to diverse user-generated inputs...
FaultXformer: A Transformer-Encoder Based Fault Classification and Location Identification model in PMU-Integrated Active Electrical Distribution System
Accurate fault detection and localization in electrical distribution systems is crucial, especially with the increasing integration of distributed energ...
Finite Difference Flow Optimization for RL Post-Training of Text-to-Image Models
Reinforcement learning (RL) has become a standard technique for post-training diffusion-based image synthesis models, as it enables learning from reward...
FlashOptim: Optimizers for Memory Efficient Training
Standard mixed-precision training of neural networks requires many bytes of accelerator memory for each model parameter. These bytes reflect not just th...
FlexiTac: A Low-Cost, Open-Source, Scalable Tactile Sensing Solution for Robotic Systems
We present FlexiTac, a low-cost, open-source, and scalable piezoresistive tactile sensing solution designed for robotic end-effectors. FlexiTac is a pra...
Fly360: Omnidirectional Obstacle Avoidance within Drone View
Obstacle avoidance in unmanned aerial vehicles (UAVs), as a fundamental capability, has gained increasing attention with the growing focus on spatial in...
From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text
The complexity of Vietnam's legal texts presents a significant barrier to public access to justice. While Large Language Models offer a promising soluti...
From Experiments to Expertise: Scientific Knowledge Consolidation for AI-Driven Computational Research
While large language models (LLMs) have transformed AI agents into proficient executors of computational materials science, performing a hundred simulat...
From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering
Existing tampering detection benchmarks largely rely on object masks, which severely misalign with the true edit signal: many pixels inside a mask are u...
From Shallow Bayesian Neural Networks to Gaussian Processes: General Convergence, Identifiability and Scalable Inference
In this work, we study scaling limits of shallow Bayesian neural networks (BNNs) via their connection to Gaussian processes (GPs), with an emphasis on s...
Generalization and Scaling Laws for Mixture-of-Experts Transformers
We develop a theory of generalization and scaling for Mixture-of-Experts (MoE) Transformers that cleanly separates \emph{active} per-input capacity from...
Generalization Properties of Score-matching Diffusion Models for Intrinsically Low-dimensional Data
Despite the remarkable empirical success of score-based diffusion models, their statistical guarantees remain underdeveloped. Existing analyses often pr...
Generalized Rapid Action Value Estimation in Memory-Constrained Environments
Generalized Rapid Action Value Estimation (GRAVE) has been shown to be a strong variant within the Monte-Carlo Tree Search (MCTS) family of algorithms f...
GeoChemAD: Benchmarking Unsupervised Geochemical Anomaly Detection for Mineral Exploration
Geochemical anomaly detection plays a critical role in mineral exploration as deviations from regional geochemical baselines may indicate mineralization...
GeoContra: From Fluent GIS Code to Verifiable Spatial Analysis with Geography-Grounded Repair
Reliable spatial analysis in GIScience requires preserving coordinate semantics, topology, units, and geographic plausibility. Current LLM-based GIS sys...
Geometry-Guided Camera Motion Understanding in VideoLLMs
Camera motion is a fundamental geometric signal that shapes visual perception and cinematic style, yet current video-capable vision-language models (Vid...
Gradient Boosting within a Single Attention Layer
Transformer attention computes a single softmax-weighted average over values -- a one-pass estimate that cannot correct its own errors. We introduce \em...
Improving Generalization on Cybersecurity Tasks with Multi-Modal Contrastive Learning
The use of ML in cybersecurity has long been impaired by generalization issues: Models that work well in controlled scenarios fail to maintain performan...
InCoder-32B-Thinking: Industrial Code World Model for Thinking
Industrial software development across chip design, GPU optimization, and embedded systems lacks expert reasoning traces showing how engineers reason ab...
InpaintSLat: Inpainting Structured 3D Latents via Initial Noise Optimization
We present a training-free approach for controllable 3D inpainting based on initial noise optimization. In the structured 3D latent diffusion framework,...
Intern-Atlas: A Methodological Evolution Graph as Research Infrastructure for AI Scientists
Existing research infrastructure is fundamentally document-centric, providing citation links between papers but lacking explicit representations of meth...
Invariant Transformation and Resampling based Epistemic-Uncertainty Reduction
An artificial intelligence (AI) model can be viewed as a function that maps inputs to outputs in high-dimensional spaces. Once designed and well trained...
Is Human Annotation Necessary? Iterative MBR Distillation for Error Span Detection in Machine Translation
Error Span Detection (ESD) is a crucial subtask in Machine Translation (MT) evaluation, aiming to identify the location and severity of translation erro...
Joint-Centric Dual Contrastive Alignment with Structure-Preserving and Information-Balanced Regularization
We propose HILBERT (HIerarchical Long-sequence Balanced Embedding with Reciprocal contrastive Training), a cross-attentive multimodal framework for lear...
JumpLoRA: Sparse Adapters for Continual Learning in Large Language Models
Adapter-based methods have become a cost-effective approach to continual learning (CL) for Large Language Models (LLMs), by sequentially learning a low-...
L2GTX: From Local to Global Time Series Explanations
Deep learning models achieve high accuracy in time series classification, yet understanding their class-level decision behaviour remains challenging. Ex...
LangChain Deep Dive
A thorough guide to LangChain's core abstractions, LCEL composable pipelines, LangGraph stateful workflows, LangSmith observability, and when to use LangChain vs direct API calls.
Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism
Large language models (LLMs) undergo alignment training to avoid harmful behaviors, yet the resulting safeguards remain brittle: jailbreaks routinely by...
Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack Detection
Multi-turn prompt injection follows a known attack path -- trust-building, pivoting, escalation but text-level defenses miss covert attacks where indivi...
Learnability and Privacy Vulnerability are Entangled in a Few Critical Weights
Prior approaches for membership privacy preservation usually update or retrain all weights in neural networks, which is costly and can lead to unnecessa...
Learning Dynamic Belief Graphs for Theory-of-mind Reasoning
Theory of Mind (ToM) reasoning with Large Language Models (LLMs) requires inferring how people's implicit, evolving beliefs shape what they seek and how...
Learning Flexible Job Shop Scheduling under Limited Buffers and Material Kitting Constraints
The Flexible Job Shop Scheduling Problem (FJSP) originates from real production lines, while some practical constraints are often ignored or idealized i...
Learning from Child-Directed Speech in Two-Language Scenarios: A French-English Case Study
Research on developmentally plausible language models has largely focused on English, leaving open questions about multilingual settings. We present a s...
Learning Rate Transfer in Normalized Transformers
The Normalized Transformer, or nGPT (arXiv:2410.01131) achieves impressive training speedups and does not require weight decay or learning rate warmup....
Learning to Reason with Insight for Informal Theorem Proving
Although most of the automated theorem-proving approaches depend on formal proof systems, informal theorem proving can align better with large language...
LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics
We present a new approach for benchmarking Large Language Model (LLM) capabilities on research-level mathematics. Existing benchmarks largely rely on st...
LiveSense: A Real-Time Wi-Fi Sensing Platform for Range-Doppler on COTS Laptop
We present LiveSense - a cross-platform that transforms a commercial off-the-shelf (COTS) Wi-Fi Network Interface Card (NIC) on a laptop into a centimet...
LlamaIndex Deep Dive
A comprehensive guide to LlamaIndex's data-centric architecture - indices, query engines, workflows, multi-document agents, and how it compares to LangChain for RAG applications.
LLM as Clinical Graph Structure Refiner: Enhancing Representation Learning in EEG Seizure Diagnosis
Electroencephalogram (EEG) signals are vital for automated seizure detection, but their inherent noise makes robust representation learning challenging....
LLM Constitutional Multi-Agent Governance
Large Language Models (LLMs) can generate persuasive influence strategies that shift cooperative behavior in multi-agent populations, but a critical que...
LLM Novice Uplift on Dual-Use, In Silico Biology Tasks
Large language models (LLMs) perform increasingly well on biology benchmarks, but it remains unclear whether they uplift novice users -- i.e., enable hu...
LoASR-Bench: Evaluating Large Speech Language Models on Low-Resource Automatic Speech Recognition Across Language Families
Large language models (LLMs) have driven substantial advances in speech language models (SpeechLMs), yielding strong performance in automatic speech rec...
Low-Rank Compression of Pretrained Models via Randomized Subspace Iteration
The massive scale of pretrained models has made efficient compression essential for practical deployment. Low-rank decomposition based on the singular v...
Low-Resource Guidance for Controllable Latent Audio Diffusion
Generative audio requires fine-grained controllable outputs, yet most existing methods require model retraining on specific controls or inference-time c...
LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation
Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained contr...
Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment
Although Automatic Speech Recognition (ASR) in Bengali has seen significant progress, processing long-duration audio and performing robust speaker diari...
Make Your LVLM KV Cache More Lightweight
Key-Value (KV) cache has become a de facto component of modern Large Vision-Language Models (LVLMs) for inference. While it enhances decoding efficiency...
Many-Tier Instruction Hierarchy in LLM Agents
Large language model agents receive instructions from many sources-system messages, user prompts, tool outputs, and more-each carrying different levels...
Mapping the Methodological Space of Classroom Interaction Research: Scale, Duration, and Modality in an Age of AI
Research on classroom interaction has long been divided between large-scale observation and in-depth ethnographic work. We propose a framework mapping t...
Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation
Recent work on chain-of-thought (CoT) faithfulness reports single aggregate numbers (e.g., DeepSeek-R1 acknowledges hints 39% of the time), implying tha...
Memory Caching: RNNs with Growing Memory
Transformers have been established as the de-facto backbones for most recent advances in sequence modeling, mainly due to their growing memory capacity...
Memory Systems: Short-Term and Long-Term
Designing memory systems for LLM agents - from in-context working memory to episodic retrieval, semantic knowledge bases, and procedural memory.
Meritocratic Fairness in Budgeted Combinatorial Multi-armed Bandits via Shapley Values
We propose a new framework for meritocratic fairness in budgeted combinatorial multi-armed bandits with full-bandit feedback (BCMAB-FBF). Unlike semi-ba...
MM-StanceDet: Retrieval-Augmented Multi-modal Multi-agent Stance Detection
Multimodal Stance Detection (MSD) is crucial for understanding public discourse, yet effectively fusing text and image, especially with conflicting sign...
Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs
Multimodal LLMs can process speech and images, but they cannot hear a speaker's voice or see an object's texture. We show this is not a failure of encod...
Model Agreement via Anchoring
Numerous lines of aim to control $ extit{model disagreement}$ -- the extent to which two machine learning models disagree in their predictions. We adop...
MoDora: Tree-Based Semi-Structured Document Analysis System
Semi-structured documents integrate diverse interleaved data elements (e.g., tables, charts, hierarchical paragraphs) arranged in various and often irre...
Module 5: LLM Agents - Overview
LLM agents as autonomous systems that reason, plan, and act using tools, memory, and multi-agent coordination.
MOO: A Multi-view Oriented Observations Dataset for Viewpoint Analysis in Cattle Re-Identification
Animal re-identification (ReID) faces critical challenges due to viewpoint variations, particularly in Aerial-Ground (AG-ReID) settings where models mus...
MovieTeller: Tool-augmented Movie Synopsis with ID Consistent Progressive Abstraction
With the explosive growth of digital entertainment, automated video summarization has become indispensable for applications such as content indexing, pe...
Multi-Agent Architectures
Building systems where multiple specialized LLM agents collaborate through orchestrator-worker, pipeline, and peer-to-peer patterns using LangGraph and CrewAI.
Multimodal Optimal Transport for Unsupervised Temporal Segmentation in Surgical Robotics
Recognizing surgical phases and steps from video is a fundamental problem in computer-assisted interventions. Recent approaches increasingly rely on lar...
MXNorm: Reusing MXFP block scales for efficient tensor normalisation
Matrix multiplication performance has long been the major bottleneck to scaling deep learning workloads, which has stimulated the design of new accelera...
Neuro-Symbolic ODE Discovery with Latent Grammar Flow
Understanding natural and engineered systems often relies on symbolic formulations, such as differential equations, which provide interpretability and t...
NOBLE: Accelerating Transformers with Nonlinear Low-Rank Branches
We introduce NOBLE (Nonlinear lOw-rank Branch for Linear Enhancement), an architectural augmentation that adds nonlinear low-rank branches to transforme...
Normativity and Productivism: Ableist Intelligence? A Degrowth Analysis of AI Sign Language Translation Tools for Deaf People
Sign languages, of any geographical or accentual variation, understandably face continuous scrutiny under the ever present popularity of verbal dictatio...
ODEBrain: Continuous-Time EEG Graph for Modeling Dynamic Brain Networks
Modeling neural population dynamics is crucial for foundational neuroscientific research and various clinical applications. Conventional latent variable...
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
While autoregressive Large Vision-Language Models (LVLMs) demonstrate remarkable proficiency in multimodal tasks, they face a 'Visual Signal Dilution' p...
PhyCo: Learning Controllable Physical Priors for Generative Motion
Modern video diffusion models excel at appearance synthesis but still struggle with physical consistency: objects drift, collisions lack realistic rebou...
PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization
Recent progress in text-conditioned human motion generation has been largely driven by diffusion models trained on large-scale human motion data. Buildi...
Planning and Reasoning
How LLM agents handle complex multi-step tasks through plan-and-execute, hierarchical planning, self-reflection, and LangGraph-based workflows.
PONTE: Personalized Orchestration for Natural Language Trustworthy Explanations
Explainable Artificial Intelligence (XAI) seeks to enhance the transparency and accountability of machine learning systems, yet most methods follow a on...
Position: agentic AI orchestration should be Bayes-consistent
LLMs excel at predictive tasks and complex reasoning tasks, but many high-value deployments rely on decisions under uncertainty, for example, which tool...
PR3DICTR: A modular AI framework for medical 3D image-based detection and outcome prediction
Three-dimensional medical image data and computer-aided decision making, particularly using deep learning, are becoming increasingly important in the me...
Predictive Coding Graphs are a Superset of Feedforward Neural Networks
Predictive coding graphs (PCGs) are a recently introduced generalization to predictive coding networks, a neuroscience-inspired probabilistic latent var...
Preference Packing: Efficient Preference Optimization for Large Language Models
Resource-efficient training optimization techniques are becoming increasingly important as the size of large language models (LLMs) continues to grow. I...
PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning
The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforc...
Process Reward Agents for Steering Knowledge-Intensive Reasoning
Reasoning in knowledge-intensive domains remains challenging as intermediate steps are often not locally verifiable: unlike math or code, evaluating ste...
Prosodic Boundary-Aware Streaming Generation for LLM-Based TTS with Streaming Text Input
Streaming TTS that receives streaming text is essential for interactive systems, yet this scheme faces two major challenges: unnatural prosody due to mi...
Quantity Convergence, Quality Divergence: Disentangling Fluency and Accuracy in L2 Mandarin Prosody
While second language (L2) learners may acquire target syntactic word order, mapping this syntax onto appropriate prosodic structures remains a persiste...
RAMoEA-QA: Hierarchical Specialization for Robust Respiratory Audio Question Answering
Conversational generative AI is rapidly entering healthcare, where general-purpose models must integrate heterogeneous patient signals and support diver...
RANGER: Sparsely-Gated Mixture-of-Experts with Adaptive Retrieval Re-ranking for Pathology Report Generation
Pathology report generation remains a relatively under-explored downstream task, primarily due to the gigapixel scale and complex morphological heteroge...
Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories
Recovering camera parameters from images and rendering scenes from novel viewpoints have long been treated as separate tasks in computer vision and grap...
ReAct Agent Pattern
Building LLM agents that interleave reasoning traces and actions in a ReAct loop to solve multi-step tasks with tool grounding.
ReAct: Synergizing Reasoning and Acting in Language Models
Engineering breakdown of the ReAct paper (Yao et al., 2022) - the foundation of every AI agent built today. Plain English, production viability rating, implementation notes.
RecaLLM: Addressing the Lost-in-Thought Phenomenon with Explicit In-Context Retrieval
We propose RecaLLM, a set of reasoning language models post-trained to make effective use of long-context information. In-context retrieval, which ident...
Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance
Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the complex reasoning capabilities of Large Reaso...
Reflective Context Learning: Studying the Optimization Primitives of Context Space
Generally capable agents must learn from experience in ways that generalize across tasks and environments. The fundamental problems of learning, includi...
Reinforcement Learning with Markov Risk Measures and Multipattern Risk Approximation
For a risk-averse finite-horizon Markov Decision Problem, we introduce a special class of Markov coherent risk measures, called mini-batch measures. We...
Reliability Gated Multi-Teacher Distillation for Low Resource Abstractive Summarization
We study multiteacher knowledge distillation for low resource abstractive summarization from a reliability aware perspective. We introduce EWAD (Entropy...
Reliable Answers for Recurring Questions: Boosting Text-to-SQL Accuracy with Template Constrained Decoding
Large language models (LLMs) have revolutionized Text-to-SQL generation, allowing users to query structured data using natural language with growing eas...
Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling
Recent research has shown that filtering massive English web corpora into high-quality subsets significantly improves training efficiency. However, for...
Research Roadmap: The Evolution of AI Agents
From Chain-of-Thought to production agent architectures. Read the 9 most important agent papers in order — with full engineering context between each one.
Resilient Strategies for Stochastic Systems: How Much Does It Take to Break a Winning Strategy?
We study the problem of resilient strategies in the presence of uncertainty. Resilient strategies enable an agent to make decisions that are robust agai...
Resources for Automated Evaluation of Assistive RAG Systems that Help Readers with News Trustworthiness Assessment
Many readers today struggle to assess the trustworthiness of online news because reliable reporting coexists with misinformation. The TREC 2025 DRAGUN (...
Rethinking Forward Processes for Score-Based Data Assimilation in High Dimensions
Data assimilation is the process of estimating the time-evolving state of a dynamical system by integrating model predictions and noisy observations. It...
RewardUQ: A Unified Framework for Uncertainty-Aware Reward Models
Reward models are central to aligning large language models (LLMs) with human preferences. Yet most approaches rely on pointwise reward estimates that o...
Risk-Aware World Model Predictive Control for Generalizable End-to-End Autonomous Driving
With advances in imitation learning (IL) and large-scale driving datasets, end-to-end autonomous driving (E2E-AD) has made great progress recently. Curr...
RoboCasa365: A Large-Scale Simulation Framework for Training and Benchmarking Generalist Robots
Recent advances in robot learning have accelerated progress toward generalist robots that can perform everyday tasks in human environments. Yet it remai...
Robustness of Agentic AI Systems via Adversarially-Aligned Jacobian Regularization
As Large Language Models (LLMs) transition into autonomous multi-agent ecosystems, robust minimax training becomes essential yet remains prone to instab...
SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning
Safety guarantees are a prerequisite to the deployment of reinforcement learning (RL) agents in safety-critical tasks. Often, deployment environments ex...
SafeGen-LLM: Enhancing Safety Generalization in Task Planning for Robotic Systems
Safety-critical task planning in robotic systems remains challenging: classical planners suffer from poor scalability, Reinforcement Learning (RL)-based...
SafeMind: A Risk-Aware Differentiable Control Framework for Adaptive and Safe Quadruped Locomotion
Learning-based quadruped controllers achieve impressive agility but typically lack formal safety guarantees under model uncertainty, perception noise, a...
SAHOO: Safeguarded Alignment for High-Order Optimization Objectives in Recursive Self-Improvement
Recursive self-improvement is moving from theory to practice: modern systems can critique, revise, and evaluate their own outputs, yet iterative self-mo...
Scaling Search Relevance: Augmenting App Store Ranking with LLM-Generated Judgments
Large-scale commercial search systems optimize for relevance to drive successful sessions that help users find what they are looking for. To maximize re...
Seeing is Believing: Robust Vision-Guided Cross-Modal Prompt Learning under Label Noise
Prompt learning is a parameter-efficient approach for vision-language models, yet its robustness under label noise is less investigated. Visual content...
SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation
We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation. It is essential for synthesizing partially...
Semantic Invariance in Agentic AI
Large Language Models (LLMs) increasingly serve as autonomous reasoning agents in decision support, scientific problem-solving, and multi-agent coordina...
Semantic Rate-Distortion for Bounded Multi-Agent Communication: Capacity-Derived Semantic Spaces and the Communication Cost of Alignment
When two agents of different computational capacities interact with the same environment, they need not compress a common semantic alphabet differently;...
Semantic Token Clustering for Efficient Uncertainty Quantification in Large Language Models
Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks. However, the truthfulness of their outputs is not guarantee...
SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport
The Platonic Representation Hypothesis posits that neural networks trained on different modalities converge toward a shared statistical model of the wor...
SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables
Real-world Table-Text question answering (QA) tasks require models that can reason across long text and source tables, traversing multiple hops and exec...
Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents
Pure-vision GUI agents provide universal interaction capabilities but suffer from severe efficiency bottlenecks due to the massive spatiotemporal redund...
Spectral Alignment in Forward-Backward Representations via Temporal Abstraction
Forward-backward (FB) representations provide a powerful framework for learning the successor representation (SR) in continuous spaces by enforcing a lo...
Splitting Argumentation Frameworks with Collective Attacks and Supports
This work proposes novel splitting techniques for argumentation formalisms that incorporate supports between defeasible elements. We base our studies on...
SpotIt+: Verification-based Text-to-SQL Evaluation with Database Constraints
We present SpotIt+, an open-source tool for evaluating Text-to-SQL systems via bounded equivalence verification. Given a generated SQL query and the gro...
SPRINT: Semi-supervised Prototypical Representation for Few-Shot Class-Incremental Tabular Learning
Real-world systems must continuously adapt to novel concepts from limited data without forgetting previously acquired knowledge. While Few-Shot Class-In...
Steve-Evolving: Open-World Embodied Self-Evolution via Fine-Grained Diagnosis and Dual-Track Knowledge Distillation
Open-world embodied agents must solve long-horizon tasks where the main bottleneck is not single-step planning quality but how interaction experience is...
Strategic Algorithmic Monoculture:Experimental Evidence from Coordination Games
AI agents increasingly operate in multi-agent environments where outcomes depend on coordination. We distinguish primary algorithmic monoculture -- base...
Structured Distillation for Personalized Agent Memory: 11x Token Reduction with Retrieval Preservation
Long conversations with an AI agent create a simple problem for one user: the history is useful, but carrying it verbatim is expensive. We study persona...
Stylistic-STORM (ST-STORM) : Perceiving the Semantic Nature of Appearance
One of the dominant paradigms in self-supervised learning (SSL), illustrated by MoCo or DINO, aims to produce robust representations by capturing featur...
SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning
Surgeons don't just see -- they interpret. When an expert observes a surgical scene, they understand not only what instrument is being used, but why it...
SurvHTE-Bench: A Benchmark for Heterogeneous Treatment Effect Estimation in Survival Analysis
Estimating heterogeneous treatment effects (HTEs) from right-censored survival data is critical in high-stakes applications such as precision medicine a...
Synthetic Computers at Scale for Long-Horizon Productivity Simulation
Realistic long-horizon productivity work is strongly conditioned on user-specific computer environments, where much of the work context is stored and or...
Synthetic data in cryptocurrencies using generative models
Data plays a fundamental role in consolidating markets, services, and products in the digital financial ecosystem. However, the use of real data, especi...
Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation
Modern optimizers like Adam and Muon are central to training large language models, but their reliance on first- and second-order momenta introduces sig...
Task Complexity Matters: An Empirical Study of Reasoning in LLMs for Sentiment Analysis
Large language models (LLMs) with reasoning capabilities have fueled a compelling narrative that reasoning universally improves performance across langu...
Task-Centric Acceleration of Small-Language Models
Small language models (SLMs) have emerged as efficient alternatives to large language models for task-specific applications. However, they are often emp...
Terminology Rarity Predicts Catastrophic Failure in LLM Translation of Low-Resource Ancient Languages: Evidence from Ancient Greek
This study presents the first systematic, reference-free human evaluation of large language model (LLM) machine translation (MT) for Ancient Greek (AG)...
The $\mathbf{Y}$-Combinator for LLMs: Solving Long-Context Rot with $λ$-Calculus
LLMs are increasingly used as general-purpose reasoners, but long inputs remain bottlenecked by a fixed context window. Recursive Language Models (RLMs)...
The EpisTwin: A Knowledge Graph-Grounded Neuro-Symbolic Architecture for Personal AI
Personal Artificial Intelligence is currently hindered by the fragmentation of user data across isolated silos. While Retrieval-Augmented Generation off...
The logic of KM belief update is contained in the logic of AGM belief revision
For each axiom of KM belief update we provide a corresponding axiom in a modal logic containing three modal operators: a unimodal belief operator $B$, a...
The Robot's Inner Critic: Self-Refinement of Social Behaviors through VLM-based Replanning
Conventional robot social behavior generation has been limited in flexibility and autonomy, relying on predefined motions or human feedback. This study...
To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling
Agentic AI architectures augment LLMs with external tools, unlocking strong capabilities. However, tool use is not always beneficial; some calls may be...
Tool Use and Function Calling
Enabling LLMs to invoke external tools and APIs through structured function calling, covering JSON schema design, Anthropic vs OpenAI formats, parallel tool calls, and production safety.
Tool Use from Python
Building LLM tool use systems in Python -- function calling, tool schemas, execution loops, error handling, and multi-step agent patterns.
TopBench: A Benchmark for Implicit Prediction and Reasoning over Tabular Question Answering
Large Language Models (LLMs) have advanced Table Question Answering, where most queries can be answered by extracting information or simple aggregation....
Toward Expert Investment Teams:A Multi-Agent LLM System with Fine-Grained Trading Tasks
The advancement of large language models (LLMs) has accelerated the development of autonomous financial trading systems. While mainstream approaches dep...
Toward Guarantees for Clinical Reasoning in Vision Language Models via Formal Verification
Vision-language models (VLMs) show promise in drafting radiology reports, yet they frequently suffer from logical inconsistencies, generating diagnostic...
Towards Improving Speaker Distance Estimation through Generative Impulse Response Augmentation
The Room Acoustics and Speaker Distance Estimation (SDE) Challenge at ICASSP 2025 explores the effectiveness of augmented room impulse response (RIR) da...
U-Cast: A Surprisingly Simple and Efficient Frontier Probabilistic AI Weather Forecaster
AI-based weather forecasting now rivals traditional physics-based ensembles, but state-of-the-art (SOTA) models rely on specialized architectures and ma...
Uncertainty Quantification for Multimodal Large Language Models with Incoherence-adjusted Semantic Volume
Despite their capabilities, Multimodal Large Language Models (MLLMs) may produce plausible but erroneous outputs, hindering reliable deployment. Accurat...
Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models
The recent success of reinforcement learning (RL) in large reasoning models has inspired the growing adoption of RL for post-training Multimodal Large L...
Understanding Usage and Engagement in AI-Powered Scientific Research Tools: The Asta Interaction Dataset
AI-powered scientific research tools are rapidly being integrated into research workflows, yet the field lacks a clear lens into how researchers use the...
Unsupervised Denoising of Real Clinical Low Dose Liver CT with Perceptual Attention Networks
With the development of deep learning, medical image processing has been widely used to assist clinical research. This paper focuses on the denoising pr...
Using Large Language Models and Knowledge Graphs to Improve the Interpretability of Machine Learning Models in Manufacturing
Explaining Machine Learning (ML) results in a transparent and user-friendly manner remains a challenging task of Explainable Artificial Intelligence (XA...
Utilizing LLMs for Industrial Process Automation
A growing number of publications address the best practices to use Large Language Models (LLMs) for software engineering in recent years. However, most...
Valence-Arousal Subspace in LLMs: Circular Emotion Geometry and Multi-Behavioral Control
We present a method to identify a valence-arousal (VA) subspace within large language model representations. From 211k emotion-labeled texts, we derive...
Var-JEPA: A Variational Formulation of the Joint-Embedding Predictive Architecture -- Bridging Predictive and Generative Self-Supervised Learning
The Joint-Embedding Predictive Architecture (JEPA) is often seen as a non-generative alternative to likelihood-based self-supervised learning, emphasizi...
VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking
Video agentic models have advanced challenging video-language tasks. However, most agentic approaches still heavily rely on greedy parsing over densely...
VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images
Vision-language models (VLMs) still struggle with visual perception tasks such as spatial understanding and viewpoint recognition. One plausible contrib...
VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning
Visual Retrieval-Augmented Generation (VRAG) empowers Vision-Language Models to retrieve and reason over visually rich documents. To tackle complex quer...
Visual-ERM: Reward Modeling for Visual Equivalence
Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representat...
VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning
Large Vision Language Models (LVLMs) achieve strong multimodal reasoning but frequently exhibit hallucinations and incorrect responses with high certain...
What Does Flow Matching Bring To TD Learning?
Recent work shows that flow matching can be effective for scalar Q-value function estimation in reinforcement learning (RL), but it remains unclear why...
When RAG Chatbots Expose Their Backend: An Anonymized Case Study of Privacy and Security Risks in Patient-Facing Medical AI
Background: Patient-facing medical chatbots based on retrieval-augmented generation (RAG) are increasingly promoted to deliver accessible, grounded heal...
When Right Meets Wrong: Bilateral Context Conditioning with Reward-Confidence Correction for GRPO
Group Relative Policy Optimization (GRPO) has emerged as an effective method for training reasoning models. While it computes advantages based on group...
Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?
Diffusion Language Models (DLMs) are often advertised as enabling parallel token generation, yet practical fast DLMs frequently converge to left-to-righ...
World Properties without World Models: Recovering Spatial and Temporal Structure from Co-occurrence Statistics in Static Word Embeddings
Recent work interprets the linear recoverability of geographic and temporal variables from large language model (LLM) hidden states as evidence for worl...
XFED: Non-Collusive Model Poisoning Attack Against Byzantine-Robust Federated Classifiers
Model poisoning attacks pose a significant security threat to Federated Learning (FL). Most existing model poisoning attacks rely on collusion, requirin...
ZipMap: Linear-Time Stateful 3D Reconstruction with Test-Time Training
Feed-forward transformer models have driven rapid progress in 3D vision, but state-of-the-art methods such as VGGT and $π^3$ have a computational cost t...