345 docs tagged with "nlp"

$V_1$: Unifying Generation and Self-Verification for Parallel Reasoners

Test-time scaling for complex reasoning tasks shows that leveraging inference-time compute, by methods such as independently sampling and aggregating mu...

$τ$-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge

Conversational agents are increasingly deployed in knowledge-intensive settings, where correct behavior depends on retrieving and applying domain-specif...

A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring

Large language models are beginning to show steganographic capabilities. Such capabilities could allow misaligned models to evade oversight mechanisms....

A Learning-based Multi-Frame Visual Feature Framework for Real-Time Driver Fatigue Detection.

A Learning-based Multi-Frame Visual Feature Framewor... - published at NAACL 2025.

A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations

Emotion Recognition in Conversations (ERC) presents unique challenges, requiring models to capture the temporal flow of multi-turn dialogues and to effe...

A Novel Hierarchical Multi-Agent System for Payments Using LLMs

Large language model (LLM) agents, such as OpenAI's Operator and Claude's Computer Use, can automate workflows but unable to handle payment tasks. Exist...

A Practical Analysis of Human Alignment with *PO.

A Practical Analysis of Human Alignment with *PO. - published at NAACL 2025.

A Semantic-Aware Layer-Freezing Approach to Computation-Efficient Fine-Tuning of Language Models.

A Semantic-Aware Layer-Freezing Approach to Computat... - published at ACL 2025.

A Training-free LLM-based Approach to General Chinese Character Error Correction.

A Training-free LLM-based Approach to General Chines... - published at ACL 2025.

Abductive Reasoning with Syllogistic Forms in Large Language Models

Research in AI using Large-Language Models (LLMs) is rapidly evolving, and the comparison of their performance with human reasoning has become a key con...

AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research.

AbGen: Evaluating Large Language Models in Ablation... - published at ACL 2025.

Across the Levels of Analysis: Explaining Predictive Processing in Humans Requires More Than Machine-Estimated Probabilities

Under the lens of Marr's levels of analysis, we critique and extend two claims about language models (LMs) and language processing: first, that predicti...

Active Few-Shot Learning for Text Classification.

How to intelligently select which examples to annotate when you only have a handful of labeled samples per class. Combines active learning with few-shot text classification to minimize annotation cost - directly applicable to intent detection, content moderation, and domain-specific NLP tasks.

Adaptive Greedy Frame Selection for Long Video Understanding

Large vision--language models (VLMs) are increasingly applied to long-video question answering, yet inference is often bottlenecked by the number of inp...

Adaptive Querying with AI Persona Priors

We study adaptive querying for learning user-dependent quantities of interest, such as responses to held-out items and psychometric indicators, within t...

Advancing Language Models through Instruction Tuning: Recent Progress and Challenges.

Advancing Language Models through Instruction Tuning... - published at EMNLP 2025.

AERA Chat: An Interactive Platform for Automated Explainable Student Answer Assessment.

AERA Chat: An Interactive Platform for Automated Exp... - published at EMNLP 2025.

Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention

Transformer attention is typically implemented using softmax normalization, which enforces attention weights with unit sum normalization. While effectiv...

AgentCPM-GUI: Building Mobile-Use Agents with Reinforcement Fine-Tuning.

AgentCPM-GUI: Building Mobile-Use Agents with Reinfo... - published at EMNLP 2025.

AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning

While Multi-Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information generated by individual parti...

Agentic Jackal: Live Execution and Semantic Value Grounding for Text-to-JQL

Translating natural language into Jira Query Language (JQL) requires resolving ambiguous field references, instance-specific categorical values, and com...

AgenticOCR: Parsing Only What You Need for Efficient Retrieval-Augmented Generation

The expansion of retrieval-augmented generation (RAG) into multimodal domains has intensified the challenge for processing complex visual documents, suc...

AgentIR: Reasoning-Aware Retrival for Deep Research Agents

Deep Research agents are rapidly emerging as primary consumers of modern retrieval systems. Unlike human users who issue and refine queries without docu...

AILS-NTUA at SemEval-2026 Task 12: Graph-Based Retrieval and Reflective Prompting for Abductive Event Reasoning

We present a winning three-stage system for SemEval 2026 Task~12: Abductive Event Reasoning that combines graph-based retrieval, LLM-driven abductive re...

AIPOM: Agent-aware Interactive Planning for Multi-Agent Systems.

AIPOM: Agent-aware Interactive Planning for Multi-Ag... - published at EMNLP 2025.

Aligning What LLMs Do and Say: Towards Self-Consistent Explanations.

Aligning What LLMs Do and Say: Towards Self-Consiste... — published at ACL 2026.

Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models.

Almieyar-Oryx-BloomBench: A Bilingual Multimodal Ben... — published at ACL 2026.

An Address Intelligence Framework for E-commerce Deliveries.

An Address Intelligence Framework for E-commerce Del... - published at EMNLP 2025.

An Agentic Approach to Generating XAI-Narratives

Explainable AI (XAI) research has experienced substantial growth in recent years. Existing XAI methods, however, have been criticized for being technica...

An Empirical Study of SFT-DPO Interaction and Parameterization in Small Language Models

Direct Preference Optimization (DPO) is widely used after supervised fine-tuning (SFT) to align language models, yet empirical behavior under small back...

An Independent Safety Evaluation of Kimi K2.5

Kimi K2.5 is an open-weight LLM that rivals closed models across coding, multimodal, and agentic benchmarks, but was released without an accompanying sa...

Analysing LLM Persona Generation and Fairness Interpretation in Polarised Geopolitical Contexts.

Analysing LLM Persona Generation and Fairness Interp... - published at EACL 2026.

Are Full Rollouts Necessary for On-Policy Distillation?

On-policy distillation (OPD) provides dense teacher feedback along rollouts generated by the student and has emerged as a promising post-training paradi...

ArgLLM-App: An Interactive System for Argumentative Reasoning with Large Language Models

Argumentative LLMs (ArgLLMs) are an existing approach leveraging Large Language Models (LLMs) and computational argumentation for decision-making, with...

Argumentation and Judgement Factors: LLM-based Discovery and Application in Insurance Disputes.

Argumentation and Judgement Factors: LLM-based Disco... - published at EACL 2026.

ARGUS: Seeing the Influence of Narrative Features on Persuasion in Argumentative Texts

Can narratives make arguments more persuasive? And to this end, which narrative features matter most? Although stories are often seen as powerful tools...

ASRank: Zero-Shot Re-Ranking with Answer Scent for Document Retrieval.

ASRank: Zero-Shot Re-Ranking with Answer Scent for D... - published at NAACL 2025.

Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent

The rapid advancement of large language models (LLMs) has enabled powerful authorship inference capabilities, raising growing concerns about unintended...

AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency

Large language models (LLMs) increasingly rely on chain-of-thought (CoT) reasoning to solve complex tasks. Yet ensuring that the reasoning trace both co...

Automated Instruction Revision (AIR): A Structured Comparison of Task Adaptation Strategies for LLM

This paper studies Automated Instruction Revision (AIR), a rule-induction-based method for adapting large language models (LLMs) to downstream tasks usi...

Automatically Discovering How Misogyny is Framed on Social Media.

Automatically Discovering How Misogyny is Framed on... - published at NAACL 2025.

AUTOSUMM: A Comprehensive Framework for LLM-Based Conversation Summarization.

AUTOSUMM: A Comprehensive Framework for LLM-Based Co... - published at ACL 2025.

BAGEL: Benchmarking Animal Knowledge Expertise in Language Models

Large language models have shown strong performance on broad-domain knowledge and reasoning benchmarks, but it remains unclear how well language models...

BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence

Large language models (LLMs) often produce confident but incorrect answers in settings where abstention would be safer. Standard evaluation protocols, h...

Benchmarking and Building Zero-Shot Hindi Retrieval Model with Hindi-BEIR and NLLB-E5.

Benchmarking and Building Zero-Shot Hindi Retrieval... - published at NAACL 2025.

BenHalluEval: A Multi-Task Hallucination Evaluation Framework for Large Language Models on Bengali

Despite Bengali being the sixth most spoken language in the world, no prior work has systematically evaluated hallucination in large language models (LL...

BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation

Accurate evaluation is central to the large language model (LLM) ecosystem, guiding model selection and downstream adoption across diverse use cases. In...

Beyond "Not Novel Enough": Enriching Scholarly Critique with LLM-Assisted Feedback.

Beyond "Not Novel Enough": Enriching Schol... - published at EACL 2026.

Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs

Large language models (LLMs) are becoming increasingly capable mathematical collaborators, but static benchmarks are no longer sufficient for evaluating...

Beyond Grid Search: Leveraging Bayesian Optimization for Accelerating RAG Pipeline Optimization.

Beyond Grid Search: Leveraging Bayesian Optimization... - published at EACL 2026.

Beyond Precision: Importance-Aware Recall for Factuality Evaluation in Long-Form LLM Generation

Evaluating the factuality of long-form output generated by large language models (LLMs) remains challenging, particularly when responses are open-ended...

Beyond Rows to Reasoning: Agentic Retrieval for Multimodal Spreadsheet Understanding and Editing

Recent advances in multimodal Retrieval-Augmented Generation (RAG) enable Large Language Models (LLMs) to analyze enterprise spreadsheet workbooks conta...

Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations

Large language models are increasingly deployed in settings where reliability matters, yet output-level uncertainty signals such as token probabilities,...

Beyond the Parameters: A Technical Survey of Contextual Enrichment in Large Language Models: From In-Context Prompting to Causal Retrieval-Augmented Generation

Large language models (LLMs) encode vast world knowledge in their parameters, yet they remain fundamentally limited by static knowledge, finite context...

BibTeX Citation Hallucinations in Scientific Publishing Agents: Evaluation and Mitigation

Large language models with web search are increasingly used in scientific publishing agents, yet they still produce BibTeX entries with pervasive field-...

BOOKCOREF: Coreference Resolution at Book Scale.

BOOKCOREF: Coreference Resolution at Book Scale. - published at ACL 2025.

BracketRank: Large Language Model Document Ranking via Reasoning-based Competitive Elimination.

BracketRank: Large Language Model Document Ranking v... — published at ACL 2026.

Bridging Attribution and Open-Set Detection using Graph-Augmented Instance Learning in Synthetic Speech.

Bridging Attribution and Open-Set Detection using Gr... - published at EACL 2026.

Can Coding Agents Reproduce Findings in Computational Materials Science?

Large language models are increasingly deployed as autonomous coding agents and have achieved remarkably strong performance on software engineering benc...

Can LLMs Understand the Impact of Trauma? Costs and Benefits of LLMs Coding the Interviews of Firearm Violence Survivors

Firearm violence is a pressing public health issue, yet research into survivors' lived experiences remains underfunded and difficult to scale. Qualitati...

Cards Against Contamination: TCG-Bench for Difficulty-Scalable Multilingual LLM Reasoning.

Cards Against Contamination: TCG-Bench for Difficult... - published at EACL 2026.

Case-Grounded Evidence Verification: A Framework for Constructing Evidence-Sensitive Supervision

Evidence-grounded reasoning requires more than attaching retrieved text to a prediction: a model should make decisions that depend on whether the provid...

Causality Elicitation from Large Language Models

Large language models (LLMs) are trained on enormous amounts of data and encode knowledge in their parameters. We propose a pipeline to elicit causal re...

CFSP: An Efficient Structured Pruning Framework for LLMs with Coarse-to-Fine Activation Information.

CFSP: An Efficient Structured Pruning Framework for... - published at COLING 2025.

Characterizing the Expressivity of Local Attention in Transformers

The transformer is the most popular neural architecture for language modeling. The cornerstone of the transformer is its global attention mechanism, whi...

CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery

Large language models (LLMs) have created new opportunities to enhance the efficiency of scholarly activities; however, challenges persist in the ethica...

CLARIN-PT-LDB: An Open LLM Leaderboard for Portuguese to assess Language, Culture and Civility

This paper reports on the development of a leaderboard of Open Large Language Models (LLM) for European Portuguese (PT-PT), and on its associated benchm...

Clinical NLP and EHR Systems

Building NLP pipelines on Electronic Health Records - named entity recognition for clinical text, negation detection, de-identification for HIPAA compliance, and fine-tuning BERT variants on medical corpora.

Code Fingerprints: Disentangled Attribution of LLM-Generated Code

The rapid adoption of Large Language Models (LLMs) has transformed modern software development by enabling automated code generation at scale. While the...

CodeGenWrangler: Data Wrangling task automation using Code-Generating Models.

CodeGenWrangler: Data Wrangling task automation usin... - published at NAACL 2025.

CodeTaxo: Enhancing Taxonomy Expansion with Limited Examples via Code Language Prompts.

CodeTaxo: Enhancing Taxonomy Expansion with Limited... - published at ACL 2025.

Cognitive Kernel: An Open-source Agent System towards Generalist Autopilots.

Cognitive Kernel: An Open-source Agent System toward... - published at NAACL 2025.

COLD-Steer: Steering Large Language Models via In-Context One-step Learning Dynamics

Activation steering methods enable inference-time control of large language model (LLM) behavior without retraining, but current approaches face a funda...

CoME: Empowering Channel-of-Mobile-Experts with Informative Hybrid-Capabilities Reasoning

Mobile Agents can autonomously execute user instructions, which requires hybrid-capabilities reasoning, including screen summary, subtask planning, acti...

Compress to Impress: Unleashing the Potential of Compressive Memory in Real-World Long-Term Conversations.

Compress to Impress: Unleashing the Potential of Com... - published at COLING 2025.

Consolidating Rewarded Perturbations for LLM Post-Training

Post-training of language models is commonly framed as a sample-score-update loop implemented by gradient descent. A recent line of work, exemplified by...

Continual Adaptation for Pacific Indigenous Speech Recognition

Speech foundation models struggle with low-resource Pacific Indigenous languages because of severe data scarcity. Furthermore, full fine-tuning risks ca...

Contract Analysis and NLP

Clause extraction, obligation detection, risk identification, and building NLP systems for commercial contract analysis at law firm and enterprise scale.

Controllable Reasoning Models Are Private Thinkers

AI agents powered by reasoning models require access to sensitive user data. However, their reasoning traces are difficult to control, which can result...

Controllable Style Arithmetic with Language Models.

Controllable Style Arithmetic with Language Models. - published at ACL 2025.

Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

Large language models (LLMs) can struggle to memorize factual knowledge in their parameters, often leading to hallucinations and poor performance on kno...

Craw4LLM: Efficient Web Crawling for LLM Pretraining.

Craw4LLM: Efficient Web Crawling for LLM Pretraining. - published at ACL 2025.

CUFE@NLU of Devanagari Script Languages 2025: Language Identification using fastText.

CUFE@NLU of Devanagari Script Languages 2025: Langua... - published at COLING 2025.

CUFE@VarDial 2025 NorSID: Multilingual BERT for Norwegian Dialect Identification and Intent Detection.

CUFE@VarDial 2025 NorSID: Multilingual BERT for Norw... - published at COLING 2025.

Current LLMs still cannot 'talk much' about grammar modules: Evidence from syntax

We aim to examine the extent to which Large Language Models (LLMs) can 'talk much' about grammar modules, providing evidence from syntax core properties...

DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science

The fast-growing demands in using Large Language Models (LLMs) to tackle complex multi-step data science tasks create an emergent need for accurate benc...

DASR: Distributed Adaptive Scene Recognition - A Multi-Agent Cloud-Edge Framework for Language-Guided Scene Detection.

DASR: Distributed Adaptive Scene Recognition - A Mul... - published at EMNLP 2025.

Data Driven Optimization of GPU efficiency for Distributed LLM Adapter Serving

Large Language Model (LLM) adapters enable low-cost model specialization, but introduce complex caching and scheduling challenges in distributed serving...

daVinci-Env: Open SWE Environment Synthesis at Scale

Training capable software engineering (SWE) agents demands large-scale, executable, and verifiable environments that provide dynamic feedback loops for...

DEMO: Reframing Dialogue Interaction with Fine-grained Element Modeling.

DEMO: Reframing Dialogue Interaction with Fine-grain... - published at ACL 2025.

Detecting and Correcting Reference Hallucinations in Commercial LLMs and Deep Research Agents

Large language models and deep research agents supply citation URLs to support their claims, yet the reliability of these citations has not been systema...

Detecting and Suppressing Reward Hacking with Gradient Fingerprints

Reinforcement learning with verifiable rewards (RLVR) typically optimizes for outcome rewards without imposing constraints on intermediate reasoning. Th...

Developing and evaluating a chatbot to support maternal health care

The ability to provide trustworthy maternal health information using phone-based chatbots can have a significant impact, particularly in low-resource se...

Different Time, Different Language: Revisiting the Bias Against Non-Native Speakers in GPT Detectors.

Different Time, Different Language: Revisiting the B... - published at EACL 2026.

Directed Social Regard: Surfacing Targeted Advocacy, Opposition, Aid, Harms, and Victimization in Online Media

The language in online platforms, influence operations, and political rhetoric frequently directs a mix of pro-social sentiment (e.g., advocacy, helpful...

Disagreeing Rationales: Rethinking Classification and Explainability Evaluation in Hate Speech Detection

Human disagreement is ubiquitous and well-known in labeling. However, variation in explanations, captured through token-level human rationales, remains...

Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems

Achieving human-like responsiveness is a critical yet challenging goal for cascaded spoken dialogue systems. Conventional ASR-LLM-TTS pipelines follow a...

DIVINE : Coordinating Multimodal Disentangled Representations for Oro-Facial Neurological Disorder Assessment.

DIVINE : Coordinating Multimodal Disentangled Repres... - published at EACL 2026.

Do Image-Text Metrics Respect Semantic Invariances?

Do Image-Text Metrics Respect Semantic Invariances? — published at ACL 2026.

Do LLMs Benefit From Their Own Words?

Multi-turn interactions with large language models typically retain the assistant's own past responses in the conversation history. In this work, we rev...

Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

Reasoning in vision-language models (VLMs) has recently attracted significant attention due to its broad applicability across diverse downstream tasks....

Does Generative AI speak Nigerian-Pidgin?: Issues about Representativeness and Bias for Multilingualism in LLMs.

Does Generative AI speak Nigerian-Pidgin?: Issues ab... - published at NAACL 2025.

Does RAG Introduce Unfairness in LLMs? Evaluating Fairness in Retrieval-Augmented Generation Systems.

Does RAG Introduce Unfairness in LLMs? Evaluating Fa... - published at COLING 2025.

Domain-Adapted Retrieval for In-Context Annotation of Pedagogical Dialogue Acts

Automated annotation of pedagogical dialogue is a high-stakes task where LLMs often fail without sufficient domain grounding. We present a domain-adapte...

Driving Chinese Spelling Correction from a Fine-Grained Perspective.

Driving Chinese Spelling Correction from a Fine-Grai... - published at COLING 2025.

DS$^2$-Instruct: Domain-Specific Data Synthesis for Large Language Models Instruction Tuning

Adapting Large Language Models (LLMs) to specialized domains requires high-quality instruction tuning datasets, which are expensive to create through hu...

Dual Debiasing for Noisy In-Context Learning for Text Generation.

Dual Debiasing for Noisy In-Context Learning for Tex... - published at ACL 2025.

Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks

Multimodal web agents that process both screenshots and accessibility trees are increasingly deployed to interact with web interfaces, yet their dual-st...

Ease of dependency distance minimization in star-like structures

The syntactic structure of a sentence can be represented as a tree where edges indicate syntactic dependencies between words. When that structure is a s...

EasyDistill: A Comprehensive Toolkit for Effective Knowledge Distillation of Large Language Models.

EasyDistill: A Comprehensive Toolkit for Effective K... - published at EMNLP 2025.

Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers.

Efficiency-Effectiveness Reranking FLOPs for LLM-bas... - published at EMNLP 2025.

EGREFINE: An Execution-Grounded Optimization Framework for Text-to-SQL Schema Refinement

Text-to-SQL enables non-expert users to query databases in natural language, yet real-world schemas often suffer from ambiguous, abbreviated, or inconsi...

Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs.

Emergent Misalignment via In-Context Learning: Narro... — published at ACL 2026.

Empathy Prediction from Diverse Perspectives.

Empathy Prediction from Diverse Perspectives. - published at ACL 2025.

Enhancing Hyperspace Analogue to Language (HAL) Representations via Attention-Based Pooling for Text Classification

The Hyperspace Analogue to Language (HAL) model relies on global word co-occurrence matrices to construct distributional semantic representations. While...

Enhancing Open-Domain Task-Solving Capability of LLMs via Autonomous Tool Integration from GitHub.

Enhancing Open-Domain Task-Solving Capability of LLM... - published at ACL 2025.

Enhancing Reliability in Community Question Answering with an Expert-Oriented RAG System.

Enhancing Reliability in Community Question Answerin... - published at EACL 2026.

EnsemW2S: Enhancing Weak-to-Strong Generalization with Large Language Model Ensembles.

EnsemW2S: Enhancing Weak-to-Strong Generalization wi... — published at ACL 2026.

ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation

As corporate responsibility increasingly incorporates environmental, social, and governance (ESG) criteria, ESG reporting is becoming a legal requiremen...

Evaluating Evidence Grounding Under User Pressure in Instruction-Tuned Language Models

In contested domains, instruction-tuned language models must balance user-alignment pressures against faithfulness to the in-context evidence. To evalua...

Evaluating Factual Density in Multi-Source RAG: A Study in Medical AI Accuracy

Retrieval-Augmented Generation (RAG) is the current industry standard for grounding AI in real-world facts. Traditional retrieval methods rely on keywor...

Evaluation of Deontic Conditional Reasoning in Large Language Models: The Case of Wason's Selection Task

As large language models (LLMs) advance in linguistic competence, their reasoning abilities are gaining increasing attention. In humans, reasoning often...

Evaluation of Deontic Conditional Reasoning in Large Language Models: The Case of Wason's Selection Task.

Evaluation of Deontic Conditional Reasoning in Large... - published at EACL 2026.

Explicit Trait Inference for Multi-Agent Coordination.

Explicit Trait Inference for Multi-Agent Coordination. — published at ACL 2026.

Exploration Hacking: Can LLMs Learn to Resist RL Training?

Reinforcement learning (RL) has become essential to the post-training of large language models (LLMs) for reasoning, agentic capabilities and alignment....

Exploring Two-Phase Continual Instruction Fine-tuning for Multilingual Adaptation in Large Language Models.

Exploring Two-Phase Continual Instruction Fine-tunin... — published at ACL 2026.

FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation of Large Language Models.

FACT-AUDIT: An Adaptive Multi-Agent Framework for Dy... - published at ACL 2025.

False Friends or Cognates? A Cross-lingual Semantic Ambiguity Evaluation for Galician, Portuguese and Spanish.

False Friends or Cognates? A Cross-lingual Semantic... — published at ACL 2026.

FedMental: Evaluating Federated Learning for Mental Health Detection from Social Media Data.

FedMental: Evaluating Federated Learning for Mental... — published at ACL 2026.

Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models

Transformer-based large language models exhibit in-context learning, enabling adaptation to downstream tasks via few-shot prompting with demonstrations....

FinSafetyBench: Evaluating LLM Safety in Real-World Financial Scenarios

Large language models (LLMs) are increasingly applied in financial scenarios. However, they may produce harmful outputs, including facilitating illegal...

Frequency-Ordered Tokenization for Better Text Compression

We present frequency-ordered tokenization, a simple preprocessing technique that improves lossless text compression by exploiting the power-law frequenc...

From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text

The complexity of Vietnam's legal texts presents a significant barrier to public access to justice. While Large Language Models offer a promising soluti...

From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes.

From Feedback to Checklists: Grounded Evaluation of... - published at EMNLP 2025.

From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding.

From Long Videos to Engaging Clips: A Human-Inspired... - published at EMNLP 2025.

From Paper to Structured JSON: An Agentic AI Workflow for Compliant BMR Digital Transformation.

From Paper to Structured JSON: An Agentic AI Workflo... - published at EACL 2026.

From Prompting to Preference Optimization: A Comparative Study of LLM-based Automated Essay Scoring

Large language models (LLMs) have recently reshaped Automated Essay Scoring (AES), yet prior studies typically examine individual techniques in isolatio...

From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models

Reinforcement learning (RL) for large language models (LLMs) increasingly relies on sparse, outcome-level rewards -- yet determining which actions withi...

GADFA: Generator-Assisted Decision-Focused Approach for Opinion Expressing Timing Identification.

GADFA: Generator-Assisted Decision-Focused Approach... - published at COLING 2025.

Generating Multi-Aspect Queries for Conversational Search.

Generating Multi-Aspect Queries for Conversational S... - published at EACL 2026.

Goal-Driven Data Story, Narrations and Explanations.

Goal-Driven Data Story, Narrations and Explanations. - published at NAACL 2025.

GRAM: Generative Recommendation via Semantic-aware Multi-granular Late Fusion.

GRAM: Generative Recommendation via Semantic-aware M... - published at ACL 2025.

H-RAG at SemEval-2026 Task 8: Hierarchical Parent-Child Retrieval for Multi-Turn RAG Conversations

We present H-RAG, our submission to SemEval-2026 Task 8 (MTRAGEval), addressing both Task A (Retrieval) and Task C (Generation with Retrieved Passages)....

H-STAR: LLM-driven Hybrid SQL-Text Adaptive Reasoning on Tables.

H-STAR: LLM-driven Hybrid SQL-Text Adaptive Reasonin... - published at NAACL 2025.

HMS-BERT: Hybrid Multi-Task Self-Training for Multilingual and Multi-Label Cyberbullying Detection

Cyberbullying on social media is inherently multilingual and multi-faceted, where abusive behaviors often overlap across multiple categories. Existing m...

How Context Shapes Truth: Geometric Transformations of Statement-level Truth Representations in LLMs.

How Context Shapes Truth: Geometric Transformations... — published at ACL 2026.

How Credible Is an Answer From Retrieval-Augmented LLMs? Investigation and Evaluation With Multi-Hop QA.

How Credible Is an Answer From Retrieval-Augmented L... - published at COLING 2025.

Hybrid Graphs for Table-and-Text based Question Answering using LLMs.

Hybrid Graphs for Table-and-Text based Question Answ... - published at NAACL 2025.

I know you are different! Towards Persona Driven Knowledge-infused Dialogue Assistant.

I know you are different! Towards Persona Driven Kno... - published at EACL 2026.

If LLMs Have Human-Like Attributes, Then So Does Age of Empires II

Much research has been carried out on large language models (LLMs) and LLM-powered agentic workflows. However, many works within the field state emergen...

InCoder-32B-Thinking: Industrial Code World Model for Thinking

Industrial software development across chip design, GPU optimization, and embedded systems lacks expert reasoning traces showing how engineers reason ab...

InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models

Reducing the hardware footprint of large language models (LLMs) during decoding is critical for efficient long-sequence generation. A key bottleneck is...

Interpretable Semantic Gradients in SSD: A PCA Sweep Approach and a Case Study on AI Discourse

Supervised Semantic Differential (SSD) is a mixed quantitative-interpretive method that models how text meaning varies with continuous individual-differ...

InTriage: Intelligent Telephone Triage in Pre-Hospital Emergency Care.

InTriage: Intelligent Telephone Triage in Pre-Hospit... - published at EMNLP 2025.

IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models.

IrokoBench: A New Benchmark for African Languages in... - published at NAACL 2025.

Is Human Annotation Necessary? Iterative MBR Distillation for Error Span Detection in Machine Translation

Error Span Detection (ESD) is a crucial subtask in Machine Translation (MT) evaluation, aiming to identify the location and severity of translation erro...

Is More Data Worth the Cost? Dataset Scaling Laws in a Tiny Attention-Only Decoder

Training Transformer language models is expensive, as performance typically improves with increasing dataset size and computational budget. Although sca...

JumpLoRA: Sparse Adapters for Continual Learning in Large Language Models

Adapter-based methods have become a cost-effective approach to continual learning (CL) for Large Language Models (LLMs), by sequentially learning a low-...

KCLarity at SemEval-2026 Task 6: Encoder and Zero-Shot Approaches to Political Evasion Detection

This paper describes the KCLarity team's participation in CLARITY, a shared task at SemEval 2026 on classifying ambiguity and evasion techniques in poli...

Language Models Learn Constructional Semantics, Not To Mention Syntax: Investigating LM Understanding of Paired-Focus Constructions

Grasping the semantics of rare constructions (form-meaning pairings) has been shown to be a challenging problem that has currently only been solved by t...

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

Large language models (LLMs) undergo alignment training to avoid harmful behaviors, yet the resulting safeguards remain brittle: jailbreaks routinely by...

LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation

A speaker encoder used in multilingual voice cloning should treat the same speaker identically regardless of which script the audio was uttered in. Off-...

Latent-GRPO: Group Relative Policy Optimization for Latent Reasoning

Latent reasoning offers a more efficient alternative to explicit reasoning by compressing intermediate reasoning into continuous representations and sub...

Learning from Child-Directed Speech in Two-Language Scenarios: A French-English Case Study

Research on developmentally plausible language models has largely focused on English, leaving open questions about multilingual settings. We present a s...

Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory

Large language model (LLM) agents require long-term user memory for consistent personalization, but limited context windows hinder tracking evolving pre...

Learning the Signature of Memorization in Autoregressive Language Models

All prior membership inference attacks for fine-tuned language models use hand-crafted heuristics (e.g., loss thresholding, Min-K\%, reference calibrati...

Learning to Reason with Insight for Informal Theorem Proving

Although most of the automated theorem-proving approaches depend on formal proof systems, informal theorem proving can align better with large language...

LEMUR: Robust Fine-Tuning for Multilingual Embedding Models for Retrieval.

LEMUR: Robust Fine-Tuning for Multilingual Embedding... - published at EACL 2026.

Leveraging Language-based Representations for Better Solving Symbol-related Problems with Large Language Models.

Leveraging Language-based Representations for Better... - published at COLING 2025.

Leveraging LLM-GNN Integration for Open-World Question Answering over Knowledge Graphs.

Leveraging LLM-GNN Integration for Open-World Questi... - published at EACL 2026.

Like a Therapist, But Not: Reddit Narratives of AI in Mental Health Contexts.

Like a Therapist, But Not: Reddit Narratives of AI i... — published at ACL 2026.

LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

Large language models (LLMs) perform increasingly well on biology benchmarks, but it remains unclear whether they uplift novice users -- i.e., enable hu...

LLM-Coordination: Evaluating and Analyzing Multi-agent Coordination Abilities in Large Language Models.

LLM-Coordination: Evaluating and Analyzing Multi-age... - published at NAACL 2025.

LLMInit: A Free Lunch from Large Language Models for Selective Initialization of Recommendation.

LLMInit: A Free Lunch from Large Language Models for... - published at EMNLP 2025.

LoASR-Bench: Evaluating Large Speech Language Models on Low-Resource Automatic Speech Recognition Across Language Families

Large language models (LLMs) have driven substantial advances in speech language models (SpeechLMs), yielding strong performance in automatic speech rec...

Loki: An Open-Source Tool for Fact Verification.

Loki: An Open-Source Tool for Fact Verification. - published at COLING 2025.

Long-form RewardBench: Evaluating Reward Models for Long-form Generation

The widespread adoption of reinforcement learning-based alignment highlights the growing importance of reward models. Various benchmarks have been built...

LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

Long-context reasoning remains a central challenge for large language models, which often fail to locate and integrate key information in extensive dist...

MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events.

MADE: A Living Benchmark for Multi-Label Text Classi... — published at ACL 2026.

Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment

Although Automatic Speech Recognition (ASR) in Bengali has seen significant progress, processing long-duration audio and performing robust speaker diari...

Many Ways to Be Fake: Benchmarking Fake News Detection Under Strategy-Driven AI Generation

Recent advances in large language models (LLMs) have enabled the large-scale generation of highly fluent and deceptive news-like content. While prior wo...

Many-Tier Instruction Hierarchy in LLM Agents

Large language model agents receive instructions from many sources-system messages, user prompts, tool outputs, and more-each carrying different levels...

Mapping the Methodological Space of Classroom Interaction Research: Scale, Duration, and Modality in an Age of AI

Research on classroom interaction has long been divided between large-scale observation and in-depth ethnographic work. We propose a framework mapping t...

McMining: Automated Discovery of Misconceptions in Student Code.

McMining: Automated Discovery of Misconceptions in S... - published at EACL 2026.

MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models.

MCPEval: Automatic MCP-based Deep Evaluation for AI... - published at EMNLP 2025.

Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation

Recent work on chain-of-thought (CoT) faithfulness reports single aggregate numbers (e.g., DeepSeek-R1 acknowledges hints 39% of the time), implying tha...

Measuring research data reuse in scholarly publications using generative artificial intelligence: Open Science Indicator development and preliminary results

Numerous metascience studies and other initiatives have begun to monitor the prevalence of open science practices when it is more important to understan...

Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory

Large language model (LLM) agents are fundamentally bottlenecked by finite context windows on long-horizon tasks. As trajectories grow, retaining tool o...

Mending the Holes: Mitigating Reward Hacking in Reinforcement Learning for Multilingual Translation

Large Language Models (LLMs) have demonstrated remarkable capability in machine translation on high-resource language pairs, yet their performance on lo...

Meta-Reasoning Improves Tool Use in Large Language Models.

Meta-Reasoning Improves Tool Use in Large Language M... - published at NAACL 2025.

Mind the Gap: Pitfalls of LLM Alignment with Asian Public Opinion

Large Language Models (LLMs) are increasingly being deployed in multilingual, multicultural settings, yet their reliance on predominantly English-centri...

Mirror in the Model: Ad Banner Image Generation via Reflective Multi-LLM and Multi-modal Agents.

Mirror in the Model: Ad Banner Image Generation via... - published at EMNLP 2025.

Mitigating Copy Bias in In-Context Learning through Neuron Pruning.

Mitigating Copy Bias in In-Context Learning through... - published at EACL 2026.

ML-Bench&Guard: Policy-Grounded Multilingual Safety Benchmark and Guardrail for Large Language Models

As Large Language Models (LLMs) are increasingly deployed in cross-linguistic contexts, ensuring safety in diverse regulatory and cultural environments...

MM-StanceDet: Retrieval-Augmented Multi-modal Multi-agent Stance Detection

Multimodal Stance Detection (MSD) is crucial for understanding public discourse, yet effectively fusing text and image, especially with conflicting sign...

Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs

Multimodal LLMs can process speech and images, but they cannot hear a speaker's voice or see an object's texture. We show this is not a failure of encod...

Models Recall What They Violate: Constraint Adherence in Multi-Turn LLM Ideation

When researchers iteratively refine ideas with large language models, do the models preserve fidelity to the original objective? We introduce DriftBench...

MoDora: Tree-Based Semi-Structured Document Analysis System

Semi-structured documents integrate diverse interleaved data elements (e.g., tables, charts, hierarchical paragraphs) arranged in various and often irre...

MORPHOGEN: A Multilingual Benchmark for Evaluating Gender-Aware Morphological Generation.

MORPHOGEN: A Multilingual Benchmark for Evaluating G... — published at ACL 2026.

MT-PingEval: Evaluating Multi-Turn Collaboration with Private Information Games

We present a scalable methodology for evaluating language models in multi-turn interactions, using a suite of collaborative games that require effective...

MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations

We present MTRAG-UN, a benchmark for exploring open challenges in multi-turn retrieval augmented generation, a popular use of large language models. We...

MULSUM: A Multimodal Summarization System with Vis-Aligner and Diversity-Aware Image Selection.

MULSUM: A Multimodal Summarization System with Vis-A... - published at EACL 2026.

Multi-Aspect Knowledge Distillation for Language Model with Low-rank Factorization

Knowledge distillation is an effective technique for pre-trained language model compression. However, existing methods only focus on the knowledge distr...

Multi-Task Pre-Finetuning of Lightweight Transformer Encoders for Text Classification and NER.

Multi-Task Pre-Finetuning of Lightweight Transformer... - published at EMNLP 2025.

Multilingual Self-Taught Faithfulness Evaluators.

Multilingual Self-Taught Faithfulness Evaluators. - published at EACL 2026.

Narrative Media Framing in Political Discourse.

Narrative Media Framing in Political Discourse. - published at ACL 2025.

Nemotron-CrossThink: Scaling Self-Learning beyond Math Reasoning.

Nemotron-CrossThink: Scaling Self-Learning beyond Ma... - published at EACL 2026.

Neuron-Aware Data Selection In Instruction Tuning For Large Language Models

Instruction Tuning (IT) has been proven to be an effective approach to unlock the powerful capabilities of large language models (LLMs). Recent studies...

No Universal Courtesy: A Cross-Linguistic, Multi-Model Study of Politeness Effects on LLMs Using the PLUM Corpus

This paper explores the response of Large Language Models (LLMs) to user prompts with different degrees of politeness and impoliteness. The Politeness T...

NOBLE: Accelerating Transformers with Nonlinear Low-Rank Branches

We introduce NOBLE (Nonlinear lOw-rank Branch for Linear Enhancement), an architectural augmentation that adds nonlinear low-rank branches to transforme...

NormAL LoRA: What is the perfect size?

NormAL LoRA: What is the perfect size? - published at EMNLP 2025.

Odysseus Navigates the Sirens' Song: Dynamic Focus Decoding for Factual and Diverse Open-Ended Text Generation.

Odysseus Navigates the Sirens' Song: Dynamic Fo... - published at ACL 2025.

On the Proper Treatment of Units in Surprisal Theory

Surprisal theory links human processing effort to the predictability of an upcoming linguistic unit, but empirical work often leaves the notion of a uni...

On the Rejection Criterion for Proxy-based Test-time Alignment

Recent works proposed test-time alignment methods that rely on a small aligned model as a proxy that guides the generation of a larger base (unaligned)...

One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers.

One Tokenizer To Rule Them All: Emergent Language Pl... — published at ACL 2026.

Open Political Corpora: Structuring, Searching, and Analyzing Political Text Collections with PoliCorp.

Open Political Corpora: Structuring, Searching, and... - published at EMNLP 2025.

Optimizing Korean-Centric LLMs via Token Pruning

This paper presents a systematic benchmark of state-of-the-art multilingual large language models (LLMs) adapted via token pruning - a compression techn...

pEBR: A Probabilistic Approach to Embedding Based Retrieval.

pEBR: A Probabilistic Approach to Embedding Based Re... - published at EMNLP 2025.

Persona-SQ: A Personalized Suggested Question Generation Framework For Real-world Documents.

Persona-SQ: A Personalized Suggested Question Genera... - published at NAACL 2025.

PledgeTracker: A System for Monitoring the Fulfilment of Pledges.

PledgeTracker: A System for Monitoring the Fulfilmen... - published at EMNLP 2025.

PO-KGQA: Preference Optimization for Low-Resource Complex Knowledge Graph Question Answering.

PO-KGQA: Preference Optimization for Low-Resource Co... — published at ACL 2026.

Pointer-CAD: Unifying B-Rep and Command Sequences via Pointer-based Edges & Faces Selection

Constructing computer-aided design (CAD) models is labor-intensive but essential for engineering and manufacturing. Recent advances in Large Language Mo...

PONTE: Personalized Orchestration for Natural Language Trustworthy Explanations

Explainable Artificial Intelligence (XAI) seeks to enhance the transparency and accountability of machine learning systems, yet most methods follow a on...

Position-Aware Depth Decay Decoding (D³): Boosting Large Language Model Inference Efficiency.

Position-Aware Depth Decay Decoding (D³): Boosting L... - published at ACL 2025.

Position: Vector Prompt Interfaces Should Be Exposed to Enable Customization of Large Language Models

As large language models (LLMs) transition from research prototypes to real-world systems, customization has emerged as a central bottleneck. While text...

Predicting States of Understanding in Explanatory Interactions Using Cognitive Load-Related Linguistic Cues

We investigate how verbal and nonverbal linguistic features, exhibited by speakers and listeners in dialogue, can contribute to predicting the listener'...

Preference Packing: Efficient Preference Optimization for Large Language Models

Resource-efficient training optimization techniques are becoming increasingly important as the size of large language models (LLMs) continues to grow. I...

Preference-Aware Rubric Learning for Personalized Evaluation

As Large Language Models (LLMs) evolve from general-purpose assistants to user-centric agents, personalization has become central to aligning model beha...

PRISM: LLM-Guided Semantic Clustering for High-Precision Topics

In this paper, we propose Precision-Informed Semantic Modeling (PRISM), a structured topic modeling framework combining the benefits of rich representat...

PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning

The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforc...

Problem-Solving Logic Guided Curriculum In-Context Learning for LLMs Complex Reasoning.

Problem-Solving Logic Guided Curriculum In-Context L... - published at ACL 2025.

Proceedings of Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning @ COLING 2025.

Proceedings of Bridging Neurons and Symbols for Natu... - published at COLING 2025.

Proceedings of Context and Meaning: Navigating Disagreements in NLP Annotation.

Proceedings of Context and Meaning: Navigating Disag... - published at COLING 2025.

Proceedings of the 5th Celtic Language Technology Workshop.

Proceedings of the 5th Celtic Language Technology Wo... - published at COLING 2025.

Proceedings of the Joint Workshop of the 9th Financial Technology and Natural Language Processing (FinNLP), the 6th Financial Narrative Processing (FNP), and the 1st Workshop on Large Language Models for Finance and Legal (LLMFinLegal).

Proceedings of the Joint Workshop of the 9th Financi... - published at COLING 2025.

PromptLab: A Collaborative Platform for Prompt Engineering and Dataset Curation.

PromptLab: A Collaborative Platform for Prompt Engin... - published at EACL 2026.

Quantity Convergence, Quality Divergence: Disentangling Fluency and Accuracy in L2 Mandarin Prosody

While second language (L2) learners may acquire target syntactic word order, mapping this syntax onto appropriate prosodic structures remains a persiste...

Rad-Flamingo: A Multimodal Prompt driven Radiology Report Generation Framework with Patient-Centric Explanations.

Rad-Flamingo: A Multimodal Prompt driven Radiology R... - published at EACL 2026.

ReAct: Synergizing Reasoning and Acting in Language Models

Engineering breakdown of the ReAct paper (Yao et al., 2022) - the foundation of every AI agent built today. Plain English, production viability rating, implementation notes.

REaR : Retrieve, Expand and Refine for Effective Multitable Retrieval.

REaR : Retrieve, Expand and Refine for Effective Mul... — published at ACL 2026.

Reasoning Gets Harder for LLMs Inside A Dialogue

Large Language Models (LLMs) achieve strong performance on many reasoning benchmarks, yet these evaluations typically focus on isolated tasks that diffe...

Reasoning Knowledge Filter for Logical Table-to-Text Generation.

Reasoning Knowledge Filter for Logical Table-to-Text... - published at COLING 2025.

Reasoning-Enhanced Domain-Adaptive Pretraining of Multimodal Large Language Models for Short Video Content Governance.

Reasoning-Enhanced Domain-Adaptive Pretraining of Mu... - published at EMNLP 2025.

RecaLLM: Addressing the Lost-in-Thought Phenomenon with Explicit In-Context Retrieval

We propose RecaLLM, a set of reasoning language models post-trained to make effective use of long-context information. In-context retrieval, which ident...

RECIPE-TKG: From Sparse History to Structured Reasoning for LLM-based Temporal Knowledge Graph Completion.

RECIPE-TKG: From Sparse History to Structured Reason... - published at EACL 2026.

Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the complex reasoning capabilities of Large Reaso...

Red Queen: Exposing Latent Multi-Turn Risks in Large Language Models.

Red Queen: Exposing Latent Multi-Turn Risks in Large... - published at ACL 2025.

RedOne: Revealing Domain-specific LLM Post-Training in Social Networking Services.

RedOne: Revealing Domain-specific LLM Post-Training... - published at EMNLP 2025.

Registering Source Tokens to Target Language Spaces in Multilingual Neural Machine Translation.

Registering Source Tokens to Target Language Spaces... - published at ACL 2025.

Reinforcement Learning for Aligning Large Language Models Agents with Interactive Environments: Quantifying and Mitigating Prompt Overfitting.

Reinforcement Learning for Aligning Large Language M... - published at NAACL 2025.

Reliability Gated Multi-Teacher Distillation for Low Resource Abstractive Summarization

We study multiteacher knowledge distillation for low resource abstractive summarization from a reliability aware perspective. We introduce EWAD (Entropy...

Reliable Answers for Recurring Questions: Boosting Text-to-SQL Accuracy with Template Constrained Decoding

Large language models (LLMs) have revolutionized Text-to-SQL generation, allowing users to query structured data using natural language with growing eas...

Reliable Multilingual Orthopedic Decision Support from Clinical Narratives: Language-Aware Adaptation and Verification-Guided Deferral

Multilingual orthopedic decision support remains challenging in low-resource healthcare settings, where clinical narratives contain specialized terminol...

Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling

Recent research has shown that filtering massive English web corpora into high-quality subsets significantly improves training efficiency. However, for...

Representing the Under-Represented: Cultural and Core Capability Benchmarks for Developing Thai Large Language Models.

Representing the Under-Represented: Cultural and Cor... - published at COLING 2025.

Research Roadmap: The Evolution of RAG

Read the 8 most important RAG papers in the right order. From the original Lewis et al. through GraphRAG. Full engineering context between each paper.

Rethinking Multiple-Choice Questions for RLVR: Unlocking Potential via Distractor Design

Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances the reasoning capabilities of Large Language Models. When applied to RLVR,...

Retrieval or Representation? Reassessing Benchmark Gaps in Multilingual and Visually Rich RAG

Retrieval-augmented generation (RAG) is a common way to ground language models in external documents and up-to-date information. Classical retrieval sys...

RevieWeaver: Weaving Together Review Insights by Leveraging LLMs and Semantic Similarity.

RevieWeaver: Weaving Together Review Insights by Lev... - published at NAACL 2025.

ReViSQL: Achieving Human-Level Text-to-SQL

Translating natural language to SQL (Text-to-SQL) is a critical challenge in both database research and data analytics applications. Recent efforts have...

RewardUQ: A Unified Framework for Uncertainty-Aware Reward Models

Reward models are central to aligning large language models (LLMs) with human preferences. Yet most approaches rely on pointwise reward estimates that o...

RichRAG: Crafting Rich Responses for Multi-faceted Queries in Retrieval-Augmented Generation.

RichRAG: Crafting Rich Responses for Multi-faceted Q... - published at COLING 2025.

RouterKGQA: Specialized--General Model Routing for Constraint-Aware Knowledge Graph Question Answering

Knowledge graph question answering (KGQA) is a promising approach for mitigating LLM hallucination by grounding reasoning in structured and verifiable k...

RTSM: Knowledge Distillation with Diverse Signals for Efficient Real-Time Semantic Matching in E-Commerce.

RTSM: Knowledge Distillation with Diverse Signals fo... - published at NAACL 2025.

RunAgent: Interpreting Natural-Language Plans with Constraint-Guided Execution

Humans solve problems by executing targeted plans, yet large language models (LLMs) remain unreliable for structured workflow execution. We propose RunA...

Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification.

Safe: Enhancing Mathematical Reasoning in Large Lang... - published at ACL 2025.

SAHOO: Safeguarded Alignment for High-Order Optimization Objectives in Recursive Self-Improvement

Recursive self-improvement is moving from theory to practice: modern systems can critique, revise, and evaluate their own outputs, yet iterative self-mo...

SC-Taxo: Hierarchical Taxonomy Generation under Semantic Consistency Constraints using Large Language Models

Scientific literature is expanding at an unprecedented pace, making it increasingly challenging to efficiently organize and access domain knowledge. A h...

Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning

The lack of reasoning capabilities in Vision-Language Models (VLMs) has remained at the forefront of research discourse. We posit that this behavior ste...

SciClaims: An End-to-End Generative System for Biomedical Claim Analysis.

SciClaims: An End-to-End Generative System for Biome... - published at EMNLP 2025.

Script-Agnosticism and its Impact on Language Identification for Dravidian Languages.

Script-Agnosticism and its Impact on Language Identi... - published at NAACL 2025.

sDPO: Don't Use Your Data All at Once.

sDPO: Don't Use Your Data All at Once. - published at COLING 2025.

SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages.

SeaLLMs 3: Open Foundation and Chat Multilingual Lar... - published at NAACL 2025.

Self-Distilled RLVR

On-policy distillation (OPD) has become a popular training paradigm in the LLM community. This paradigm selects a larger model as the teacher to provide...

Semantic Invariance in Agentic AI

Large Language Models (LLMs) increasingly serve as autonomous reasoning agents in decision support, scientific problem-solving, and multi-agent coordina...

Semantic Token Clustering for Efficient Uncertainty Quantification in Large Language Models

Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks. However, the truthfulness of their outputs is not guarantee...

Semantic Triplet Restoration: A Novel Protocol for Hierarchical Table Understanding in Large Language Models

Table question answering requires models to recover semantic relations encoded implicitly by two-dimensional layout, merged cells, and hierarchical head...

Semi-automatic Sequential Sentence Classification in the Discourse Analysis Tool Suite.

Semi-automatic Sequential Sentence Classification in... - published at NAACL 2025.

Sens-Merging: Sensitivity-Guided Parameter Balancing for Merging Large Language Models.

Sens-Merging: Sensitivity-Guided Parameter Balancing... - published at ACL 2025.

Sentiment Analysis of German Sign Language Fairy Tales

We present a dataset and a model for sentiment analysis of German sign language (DGS) fairy tales. First, we perform sentiment analysis for three levels...

SlackAgents: Scalable Collaboration of AI Agents in Workspaces.

SlackAgents: Scalable Collaboration of AI Agents in... - published at EMNLP 2025.

SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs.

SoftCoT: Soft Chain-of-Thought for Efficient Reasoni... - published at ACL 2025.

SongSong: A Time Phonograph for Chinese SongCi Music from Thousand of Years Away

Recently, there have been significant advancements in music generation. However, existing models primarily focus on creating modern pop songs, making it...

SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

Real-world Table-Text question answering (QA) tasks require models that can reason across long text and source tables, traversing multiple hops and exec...

Speak in Context: Multilingual ASR with Speech Context Alignment via Contrastive Learning

Automatic speech recognition (ASR) has benefited from advances in pretrained speech and language models, yet most systems remain constrained to monoling...

Stable Behavior, Limited Variation: Persona Validity in LLM Agents for Urban Sentiment Perception

Large Language Models (LLMs) are increasingly used as proxies for human perception in urban analysis, yet it remains unclear whether persona prompting p...

StoryScope: Investigating idiosyncrasies in AI fiction

As AI-generated fiction becomes increasingly prevalent, questions of authorship and originality are becoming central to how written work is evaluated. W...

Structured Distillation for Personalized Agent Memory: 11x Token Reduction with Retrieval Preservation

Long conversations with an AI agent create a simple problem for one user: the history is useful, but carrying it verbatim is expensive. We study persona...

Structured Tender Entities Extraction from Complex Tables with Few-short Learning.

Structured Tender Entities Extraction from Complex T... - published at COLING 2025.

SwanNLP at SemEval-2026 Task 5: An LLM-based Framework for Plausibility Scoring in Narrative Word Sense Disambiguation

Recent advances in language models have substantially improved Natural Language Understanding (NLU). Although widely used benchmarks suggest that Large...

Synthetic Computers at Scale for Long-Horizon Productivity Simulation

Realistic long-horizon productivity work is strongly conditioned on user-specific computer environments, where much of the work context is stored and or...

TableCoder: Table Extraction from Text via Reliable Code Generation.

TableCoder: Table Extraction from Text via Reliable... - published at ACL 2025.

Take Out Your Calculators: Estimating the Real Difficulty of Question Items with LLM Student Simulations.

Take Out Your Calculators: Estimating the Real Diffi... — published at ACL 2026.

Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation

Modern optimizers like Adam and Muon are central to training large language models, but their reliance on first- and second-order momenta introduces sig...

Task Complexity Matters: An Empirical Study of Reasoning in LLMs for Sentiment Analysis

Large language models (LLMs) with reasoning capabilities have fueled a compelling narrative that reasoning universally improves performance across langu...

Task-Centric Acceleration of Small-Language Models

Small language models (SLMs) have emerged as efficient alternatives to large language models for task-specific applications. However, they are often emp...

TaxonRL: Reinforcement Learning with Intermediate Rewards for Interpretable Fine-Grained Visual Reasoning

Traditional vision-language models struggle with contrastive fine-grained taxonomic reasoning, particularly when distinguishing between visually similar...

TelAgentBench: A Multi-faceted Benchmark for Evaluating LLM-based Agents in Telecommunications.

TelAgentBench: A Multi-faceted Benchmark for Evaluat... - published at EMNLP 2025.

Terminology Rarity Predicts Catastrophic Failure in LLM Translation of Low-Resource Ancient Languages: Evidence from Ancient Greek

This study presents the first systematic, reference-free human evaluation of large language model (LLM) machine translation (MT) for Ancient Greek (AG)...

Text-Attributed Graph Learning with Coupled Augmentations.

Text-Attributed Graph Learning with Coupled Augmenta... - published at COLING 2025.

The Art That Poses Back: Assessing AI Pastiches after Contemporary Artworks

This study explores artificial visual creativity, focusing on ChatGPT's ability to generate new images intentionally pastiching original artworks such a...

The Company You Keep: How LLMs Respond to Dark Triad Traits

Large Language Models (LLMs) often exhibit highly agreeable and reinforcing conversational styles, also known as AI-sycophancy. Although this behavior i...

The EpisTwin: A Knowledge Graph-Grounded Neuro-Symbolic Architecture for Personal AI

Personal Artificial Intelligence is currently hindered by the fragmentation of user data across isolated silos. While Retrieval-Augmented Generation off...

The Invalsi Benchmarks: measuring the Linguistic and Mathematical understanding of Large Language Models in Italian.

The Invalsi Benchmarks: measuring the Linguistic and... - published at COLING 2025.

The LLM Language Network: A Neuroscientific Approach for Identifying Causally Task-Relevant Units.

The LLM Language Network: A Neuroscientific Approach... - published at NAACL 2025.

The Role of Handling Attributive Nouns in Improving Chinese-To-English Machine Translation.

The Role of Handling Attributive Nouns in Improving... - published at COLING 2025.

Thinking with DistilQwen: A Tale of Four Distilled Reasoning and Reward Model Series.

Thinking with DistilQwen: A Tale of Four Distilled R... - published at EMNLP 2025.

TIPA: Typologically Informed Parameter Aggregation.

TIPA: Typologically Informed Parameter Aggregation. - published at EACL 2026.

TopBench: A Benchmark for Implicit Prediction and Reasoning over Tabular Question Answering

Large Language Models (LLMs) have advanced Table Question Answering, where most queries can be answered by extracting information or simple aggregation....

Toward Automatic Filling of Case Report Forms: A Case Study on Data from an Italian Emergency Department

Case Report Forms (CRFs) collect data about patients and are at the core of well-established practices to conduct research in clinical settings. With th...

Toward Guarantees for Clinical Reasoning in Vision Language Models via Formal Verification

Vision-language models (VLMs) show promise in drafting radiology reports, yet they frequently suffer from logical inconsistencies, generating diagnostic...

Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings.

Towards Mitigating Hallucinations in Large Vision-La... — published at ACL 2026.

Transparent AI for Mathematics: Transformer-Based Large Language Models for Mathematical Entity Relationship Extraction with XAI

Mathematical text understanding is a challenging task due to the presence of specialized entities and complex relationships between them. This study for...

TRUSTEVAL: A Dynamic Evaluation Toolkit on Trustworthiness of Generative Foundation Models.

TRUSTEVAL: A Dynamic Evaluation Toolkit on Trustwort... - published at NAACL 2025.

TT-SI: Self-Improving LLM Agents with Test-Time Training.

TT-SI: Self-Improving LLM Agents with Test-Time Trai... — published at ACL 2026.

Typology-Aware Multilingual Morphosyntactic Parsing with Joint Abstract Node Modeling.

Typology-Aware Multilingual Morphosyntactic Parsing... — published at ACL 2026.

UbuntuGuard: A Culturally-Grounded Policy Benchmark for Equitable AI Safety in African Languages.

UbuntuGuard: A Culturally-Grounded Policy Benchmark... — published at ACL 2026.

UIPress: Bringing Optical Token Compression to UI-to-Code Generation

UI-to-Code generation requires vision-language models (VLMs) to produce thousands of tokens of structured HTML/CSS from a single screenshot, making visu...

Uncertainty Quantification for Multimodal Large Language Models with Incoherence-adjusted Semantic Volume

Despite their capabilities, Multimodal Large Language Models (MLLMs) may produce plausible but erroneous outputs, hindering reliable deployment. Accurat...

UniAudio-Token: Empowering Semantic Speech Tokenizers with General Audio Perception

Semantic speech tokenizers have become a widely used interface for Audio-LLMs, owing to their compact single-codebook design and strong linguistic align...

Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity

Reinforcement Learning with Verifiable Rewards (RLVR) has achieved substantial gains in single-attempt accuracy (Pass@1) on reasoning tasks, yet often s...

Universal statistical laws governing culinary design

Cooking is a cultural expression of human creativity that transcends geography and time through the orchestration of ingredients and techniques, much li...

UrBLiMP: A Benchmark for Evaluating the Linguistic Competence of Large Language Models in Urdu.

UrBLiMP: A Benchmark for Evaluating the Linguistic C... — published at ACL 2026.

Valence-Arousal Subspace in LLMs: Circular Emotion Geometry and Multi-Behavioral Control

We present a method to identify a valence-arousal (VA) subspace within large language model representations. From 211k emotion-labeled texts, we derive...

VCRMNER: Visual Cue Refinement in Multimodal NER using CLIP Prompts.

VCRMNER: Visual Cue Refinement in Multimodal NER usi... - published at COLING 2025.

VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking

Video agentic models have advanced challenging video-language tasks. However, most agentic approaches still heavily rely on greedy parsing over densely...

Vision-Language Models Struggle to Align Entities across Modalities.

Vision-Language Models Struggle to Align Entities ac... - published at ACL 2025.

Vision-Language Models Suppress Female Representations Under Ambiguous Input

Alignment teaches vision-language models (VLMs) to avoid expressing demographic biases, and when gender is clearly visible they largely succeed. Far les...

VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images

Vision-language models (VLMs) still struggle with visual perception tasks such as spatial understanding and viewpoint recognition. One plausible contrib...

VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning

Large Vision Language Models (LVLMs) achieve strong multimodal reasoning but frequently exhibit hallucinations and incorrect responses with high certain...

VoxpopuliTTS: a large-scale multilingual TTS corpus for zero-shot speech generation.

VoxpopuliTTS: a large-scale multilingual TTS corpus... - published at COLING 2025.

Watching the AI Watchdogs: A Fairness and Robustness Analysis of AI Safety Moderation Classifiers.

Watching the AI Watchdogs: A Fairness and Robustness... - published at NAACL 2025.

What Am I Missing? Question-Answering as Hidden State Probing

Test-time reasoning has become a significant field of study since the introduction of chain-of-thought reasoning in large language models (LLMs). Howeve...

What Gets Unmasked First? Trajectory Analysis of Diffusion Models for Graph-to-Text Generation

We present the first systematic study of masked diffusion language models (MDLMs) for graph-to-text generation. We analyze MDLM generation trajectories...

What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning.

What Makes for Good Visual Instructions? Synthesizin... - published at COLING 2025.

When Contextual Inference Fails: Cancelability in Interactive Instruction Following

We investigate the separation of literal interpretation from contextual inference in a collaborative block-building task where a builder must resolve un...

When Do Language Models Endorse Limitations on Human Rights Principles?

As Large Language Models (LLMs) increasingly mediate global information access with the potential to shape public discourse, their alignment with univer...

When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models

Large language models (LLMs) often achieve strong performance on reasoning benchmarks, but final-answer accuracy alone does not show whether they faithf...

When RAG Chatbots Expose Their Backend: An Anonymized Case Study of Privacy and Security Risks in Patient-Facing Medical AI

Background: Patient-facing medical chatbots based on retrieval-augmented generation (RAG) are increasingly promoted to deliver accessible, grounded heal...

Where Do LLMs Compose Meaning? A Layerwise Analysis of Compositional Robustness.

Where Do LLMs Compose Meaning? A Layerwise Analysis... - published at EACL 2026.

Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?

Diffusion Language Models (DLMs) are often advertised as enabling parallel token generation, yet practical fast DLMs frequently converge to left-to-righ...

Why Do LLM-based Web Agents Fail? A Hierarchical Planning Perspective.

Why Do LLM-based Web Agents Fail? A Hierarchical Pla... — published at ACL 2026.

World Properties without World Models: Recovering Spatial and Temporal Structure from Co-occurrence Statistics in Static Word Embeddings

Recent work interprets the linear recoverability of geographic and temporal variables from large language model (LLM) hidden states as evidence for worl...

XGUARD: A Graded Benchmark for Evaluating Safety Failures of Large Language Models on Extremist Content.

XGUARD: A Graded Benchmark for Evaluating Safety Fai... — published at ACL 2026.

You Can't Fight in Here! This is BBS!

Norm, the formal theoretical linguist, and Claudette, the computational language scientist, have a lovely time discussing whether modern language models...