368 docs tagged with "llm"

$V_1$: Unifying Generation and Self-Verification for Parallel Reasoners

Test-time scaling for complex reasoning tasks shows that leveraging inference-time compute, by methods such as independently sampling and aggregating mu...

$τ$-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge

Conversational agents are increasingly deployed in knowledge-intensive settings, where correct behavior depends on retrieving and applying domain-specif...

A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring

Large language models are beginning to show steganographic capabilities. Such capabilities could allow misaligned models to evade oversight mechanisms....

A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations

Emotion Recognition in Conversations (ERC) presents unique challenges, requiring models to capture the temporal flow of multi-turn dialogues and to effe...

A Novel Hierarchical Multi-Agent System for Payments Using LLMs

Large language model (LLM) agents, such as OpenAI's Operator and Claude's Computer Use, can automate workflows but unable to handle payment tasks. Exist...

Abductive Reasoning with Syllogistic Forms in Large Language Models

Research in AI using Large-Language Models (LLMs) is rapidly evolving, and the comparison of their performance with human reasoning has become a key con...

Across the Levels of Analysis: Explaining Predictive Processing in Humans Requires More Than Machine-Estimated Probabilities

Under the lens of Marr's levels of analysis, we critique and extend two claims about language models (LMs) and language processing: first, that predicti...

Adaptive Greedy Frame Selection for Long Video Understanding

Large vision--language models (VLMs) are increasingly applied to long-video question answering, yet inference is often bottlenecked by the number of inp...

Adaptive Querying with AI Persona Priors

We study adaptive querying for learning user-dependent quantities of interest, such as responses to held-out items and psychometric indicators, within t...

Advanced RAG Patterns

Go beyond naive RAG - master query transformation, HyDE, multi-query retrieval, Self-RAG, Corrective RAG, and iterative retrieval patterns for complex questions.

Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention

Transformer attention is typically implemented using softmax normalization, which enforces attention weights with unit sum normalization. While effectiv...

Agent Evaluation

Measuring LLM agent performance through trajectory analysis, benchmark suites, LLM-as-judge, failure taxonomies, and production monitoring strategies.

Agent Safety and Guardrails

Implementing defense-in-depth safety for production LLM agents - prompt injection defense, input/output guardrails, tool sandboxing, HITL confirmation, and audit logging.

AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning

While Multi-Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information generated by individual parti...

Agentic Jackal: Live Execution and Semantic Value Grounding for Text-to-JQL

Translating natural language into Jira Query Language (JQL) requires resolving ambiguous field references, instance-specific categorical values, and com...

Agentic RAG

Build agents that control their own retrieval - multi-step reasoning, router agents, ReAct loops, LangGraph stateful pipelines, and production patterns for agentic retrieval systems.

AgenticOCR: Parsing Only What You Need for Efficient Retrieval-Augmented Generation

The expansion of retrieval-augmented generation (RAG) into multimodal domains has intensified the challenge for processing complex visual documents, suc...

AgentIR: Reasoning-Aware Retrival for Deep Research Agents

Deep Research agents are rapidly emerging as primary consumers of modern retrieval systems. Unlike human users who issue and refine queries without docu...

AI Safety Evaluations

Safety benchmarks, capability evaluations, LLM judges, uplift assessments, and how labs like Anthropic use evaluation-gated deployment through Responsible Scaling Policies.

AILS-NTUA at SemEval-2026 Task 12: Graph-Based Retrieval and Reflective Prompting for Abductive Event Reasoning

We present a winning three-stage system for SemEval 2026 Task~12: Abductive Event Reasoning that combines graph-based retrieval, LLM-driven abductive re...

An Agentic Approach to Generating XAI-Narratives

Explainable AI (XAI) research has experienced substantial growth in recent years. Existing XAI methods, however, have been criticized for being technica...

An Empirical Study of SFT-DPO Interaction and Parameterization in Small Language Models

Direct Preference Optimization (DPO) is widely used after supervised fine-tuning (SFT) to align language models, yet empirical behavior under small back...

An Independent Safety Evaluation of Kimi K2.5

Kimi K2.5 is an open-weight LLM that rivals closed models across coding, multimodal, and agentic benchmarks, but was released without an accompanying sa...

Are Full Rollouts Necessary for On-Policy Distillation?

On-policy distillation (OPD) provides dense teacher feedback along rollouts generated by the student and has emerged as a promising post-training paradi...

ArgLLM-App: An Interactive System for Argumentative Reasoning with Large Language Models

Argumentative LLMs (ArgLLMs) are an existing approach leveraging Large Language Models (LLMs) and computational argumentation for decision-making, with...

ARGUS: Seeing the Influence of Narrative Features on Persuasion in Argumentative Texts

Can narratives make arguments more persuasive? And to this end, which narrative features matter most? Although stories are often seen as powerful tools...

Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent

The rapid advancement of large language models (LLMs) has enabled powerful authorship inference capabilities, raising growing concerns about unintended...

AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency

Large language models (LLMs) increasingly rely on chain-of-thought (CoT) reasoning to solve complex tasks. Yet ensuring that the reasoning trace both co...

Attention Is All You Need

The 2017 Vaswani et al. paper that replaced recurrent networks with pure attention - and why it changed everything.

Audio-Language Models

How modern AI systems process speech and audio - from Whisper's spectrogram-based ASR to end-to-end audio understanding in GPT-4o and Gemini.

Automated Instruction Revision (AIR): A Structured Comparison of Task Adaptation Strategies for LLM

This paper studies Automated Instruction Revision (AIR), a rule-induction-based method for adapting large language models (LLMs) to downstream tasks usi...

Autoregressive Decoding

Understand how LLMs generate tokens one at a time, why decoding is memory-bandwidth bound, and how to reason about inference latency with the roofline model.

BAGEL: Benchmarking Animal Knowledge Expertise in Language Models

Large language models have shown strong performance on broad-domain knowledge and reasoning benchmarks, but it remains unclear how well language models...

BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence

Large language models (LLMs) often produce confident but incorrect answers in settings where abstention would be safer. Standard evaluation protocols, h...

Benchmarks: MMLU, HumanEval, and HELM

Navigate the LLM benchmark ecosystem - what each benchmark actually measures, saturation, contamination, and how to build benchmarks that can't be gamed.

BenHalluEval: A Multi-Task Hallucination Evaluation Framework for Large Language Models on Bengali

Despite Bengali being the sixth most spoken language in the world, no prior work has systematically evaluated hallucination in large language models (LL...

BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation

Accurate evaluation is central to the large language model (LLM) ecosystem, guiding model selection and downstream adoption across diverse use cases. In...

Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs

Large language models (LLMs) are becoming increasingly capable mathematical collaborators, but static benchmarks are no longer sufficient for evaluating...

Beyond Precision: Importance-Aware Recall for Factuality Evaluation in Long-Form LLM Generation

Evaluating the factuality of long-form output generated by large language models (LLMs) remains challenging, particularly when responses are open-ended...

Beyond Rows to Reasoning: Agentic Retrieval for Multimodal Spreadsheet Understanding and Editing

Recent advances in multimodal Retrieval-Augmented Generation (RAG) enable Large Language Models (LLMs) to analyze enterprise spreadsheet workbooks conta...

Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations

Large language models are increasingly deployed in settings where reliability matters, yet output-level uncertainty signals such as token probabilities,...

Beyond the Parameters: A Technical Survey of Contextual Enrichment in Large Language Models: From In-Context Prompting to Causal Retrieval-Augmented Generation

Large language models (LLMs) encode vast world knowledge in their parameters, yet they remain fundamentally limited by static knowledge, finite context...

BibTeX Citation Hallucinations in Scientific Publishing Agents: Evaluation and Mitigation

Large language models with web search are increasingly used in scientific publishing agents, yet they still produce BibTeX entries with pervasive field-...

BLEU, ROUGE, and Generation Metrics

Master reference-based generation metrics - BLEU, ROUGE, BERTScore, BLEURT - and know exactly when each one lies to you.

Caching Strategies

Four caching layers for LLM applications - exact match, semantic similarity, provider prefix caching, and KV cache - with implementation patterns and production tradeoffs.

Calling LLM APIs

Production patterns for calling LLM APIs - authentication, retry logic, rate limiting, error handling, async calls, and the Anthropic and OpenAI Python SDKs.

Can Coding Agents Reproduce Findings in Computational Materials Science?

Large language models are increasingly deployed as autonomous coding agents and have achieved remarkably strong performance on software engineering benc...

Can LLMs Understand the Impact of Trauma? Costs and Benefits of LLMs Coding the Interviews of Firearm Violence Survivors

Firearm violence is a pressing public health issue, yet research into survivors' lived experiences remains underfunded and difficult to scale. Qualitati...

Case Studies: Production LLM Systems

Five detailed production LLM architectures - GitHub Copilot, Notion AI, customer support bots, enterprise RAG, and code review agents - with real architecture decisions, scale numbers, and lessons learned.

Case-Grounded Evidence Verification: A Framework for Constructing Evidence-Sensitive Supervision

Evidence-grounded reasoning requires more than attaching retrieved text to a prediction: a model should make decisions that depend on whether the provid...

Causal Language Modeling and GPT

Learn how GPT-style autoregressive models work, the evolution from GPT-1 to GPT-4, sampling strategies, and why causal LM became the dominant paradigm for LLMs.

Causality Elicitation from Large Language Models

Large language models (LLMs) are trained on enormous amounts of data and encode knowledge in their parameters. We propose a pipeline to elicit causal re...

Chain-of-Thought Prompting

Learn how to unlock multi-step reasoning in LLMs by making them think out loud - and why this simple technique dramatically improves accuracy on complex tasks.

Chain-of-Thought Reasoning at Inference Time

How chain-of-thought prompting transforms model reasoning - from the Wei et al. 2022 breakthrough to self-consistency, process supervision, and the faithfulness problem.

Characterizing the Expressivity of Local Attention in Transformers

The transformer is the most popular neural architecture for language modeling. The cornerstone of the transformer is its global attention mechanism, whi...

CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery

Large language models (LLMs) have created new opportunities to enhance the efficiency of scholarly activities; however, challenges persist in the ethica...

CLARIN-PT-LDB: An Open LLM Leaderboard for Portuguese to assess Language, Culture and Civility

This paper reports on the development of a leaderboard of Open Large Language Models (LLM) for European Portuguese (PT-PT), and on its associated benchm...

CLIP and Contrastive Learning

How CLIP uses contrastive learning on 400M image-text pairs to build a shared semantic embedding space - enabling zero-shot classification without labeled data.

Code and Math Specialized Models

How domain-specific pre-training and fine-tuning on code and math data produces models that outperform general LLMs on programming and reasoning tasks - and when to use them in production.

Code Fingerprints: Disentangled Attribution of LLM-Generated Code

The rapid adoption of Large Language Models (LLMs) has transformed modern software development by enabling automated code generation at scale. While the...

COLD-Steer: Steering Large Language Models via In-Context One-step Learning Dynamics

Activation steering methods enable inference-time control of large language model (LLM) behavior without retraining, but current approaches face a funda...

CoME: Empowering Channel-of-Mobile-Experts with Informative Hybrid-Capabilities Reasoning

Mobile Agents can autonomously execute user instructions, which requires hybrid-capabilities reasoning, including screen summary, subtask planning, acti...

Consolidating Rewarded Perturbations for LLM Post-Training

Post-training of language models is commonly framed as a sample-score-update loop implemented by gradient descent. A recent line of work, exemplified by...

Constitutional AI

How Anthropic replaced human feedback with AI feedback guided by explicit principles - the Constitutional AI technique, RLAIF, and how it enables scalable alignment.

Constrained Decoding - How It Works

The mathematics of constrained decoding - finite-state machines, token masking, context-free grammars, and how the Outlines library achieves guaranteed JSON schema conformance at generation time.

Context Compression Techniques

How LLMLingua, AutoCompressors, GIST tokens, and selective compression reduce long contexts to fewer tokens while preserving the information needed to answer queries.

Context Window Extension - YaRN, LongRoPE, LongLoRA

How position interpolation, NTK-aware scaling, YaRN, and LongLoRA extend pretrained models to context windows far beyond their original training length.

Context Window Management

Engineering strategies for managing context windows in production LLM applications - history truncation, compression, RAG ordering, and prompt caching design.

Continual Adaptation for Pacific Indigenous Speech Recognition

Speech foundation models struggle with low-resource Pacific Indigenous languages because of severe data scarcity. Furthermore, full fine-tuning risks ca...

Continuous Batching

Learn how continuous batching eliminates GPU idle time by replacing finished sequences immediately rather than waiting for the longest request in a batch to complete.

Controllable Reasoning Models Are Private Thinkers

AI agents powered by reasoning models require access to sensitive user data. However, their reasoning traces are difficult to control, which can result...

Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

Large language models (LLMs) can struggle to memorize factual knowledge in their parameters, often leading to hallucinations and poor performance on kno...

Current LLMs still cannot 'talk much' about grammar modules: Evidence from syntax

We aim to examine the extent to which Large Language Models (LLMs) can 'talk much' about grammar modules, providing evidence from syntax core properties...

DARE - Delta Weight Sparsification

How DARE randomly drops delta weights and rescales the remainder to dramatically reduce interference when merging multiple fine-tuned models.

DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science

The fast-growing demands in using Large Language Models (LLMs) to tackle complex multi-step data science tasks create an emergent need for accurate benc...

Data Driven Optimization of GPU efficiency for Distributed LLM Adapter Serving

Large Language Model (LLM) adapters enable low-cost model specialization, but introduce complex caching and scheduling challenges in distributed serving...

daVinci-Env: Open SWE Environment Synthesis at Scale

Training capable software engineering (SWE) agents demands large-scale, executable, and verifiable environments that provide dynamic feedback loops for...

DeepSeek MoE Architecture

DeepSeek's innovations in mixture of experts - fine-grained experts, shared experts, DeepSeek-V2 and V3, multi-token prediction, and training for $6M.

DeepSeek-R1 - Open Source Reasoning

How DeepSeek built an open-weights reasoning model using pure RL with GRPO, the R1-Zero experiment, distillation into smaller models, and what open-source reasoning means for the research community.

Detecting and Correcting Reference Hallucinations in Commercial LLMs and Deep Research Agents

Large language models and deep research agents supply citation URLs to support their claims, yet the reliability of these citations has not been systema...

Detecting and Suppressing Reward Hacking with Gradient Fingerprints

Reinforcement learning with verifiable rewards (RLVR) typically optimizes for outcome rewards without imposing constraints on intermediate reasoning. Th...

Developing and evaluating a chatbot to support maternal health care

The ability to provide trustworthy maternal health information using phone-based chatbots can have a significant impact, particularly in low-resource se...

Diffusion Models

How denoising diffusion models learn to reverse a Gaussian noise process to generate high-quality images from text prompts.

Directed Social Regard: Surfacing Targeted Advocacy, Opposition, Aid, Harms, and Victimization in Online Media

The language in online platforms, influence operations, and political rhetoric frequently directs a mix of pro-social sentiment (e.g., advocacy, helpful...

Disagreeing Rationales: Rethinking Classification and Explainability Evaluation in Hate Speech Detection

Human disagreement is ubiquitous and well-known in labeling. However, variation in explanations, captured through token-level human rationales, remains...

Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems

Achieving human-like responsiveness is a critical yet challenging goal for cascaded spoken dialogue systems. Conventional ASR-LLM-TTS pipelines follow a...

Do LLMs Benefit From Their Own Words?

Multi-turn interactions with large language models typically retain the assistant's own past responses in the conversation history. In this work, we rev...

Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

Reasoning in vision-language models (VLMs) has recently attracted significant attention due to its broad applicability across diverse downstream tasks....

Document Chunking Strategies

Master the art and science of splitting documents into chunks that maximize retrieval precision - the most underestimated decision in RAG system design.

Domain-Adapted Retrieval for In-Context Annotation of Pedagogical Dialogue Acts

Automated annotation of pedagogical dialogue is a high-stakes task where LLMs often fail without sufficient domain grounding. We present a domain-adapte...

DPO and Modern Alignment Techniques

Direct Preference Optimization and its successors - how DPO eliminates the need for a separate reward model and RL training, plus IPO, KTO, SimPO, and ORPO.

DPO: Direct Preference Optimization

Master DPO - the elegant insight that you can optimize LLMs for human preferences without training a reward model or running RL, derived directly from the optimal RLHF policy.

DS$^2$-Instruct: Domain-Specific Data Synthesis for Large Language Models Instruction Tuning

Adapting Large Language Models (LLMs) to specialized domains requires high-quality instruction tuning datasets, which are expensive to create through hu...

Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks

Multimodal web agents that process both screenshots and accessibility trees are increasingly deployed to interact with web interfaces, yet their dual-st...

Ease of dependency distance minimization in star-like structures

The syntactic structure of a sentence can be represented as a tree where edges indicate syntactic dependencies between words. When that structure is a s...

EGREFINE: An Execution-Grounded Optimization Framework for Text-to-SQL Schema Refinement

Text-to-SQL enables non-expert users to query databases in natural language, yet real-world schemas often suffer from ambiguous, abbreviated, or inconsi...

Embedding Models - The Landscape

A comprehensive survey of the embedding model ecosystem - SBERT, contrastive learning, SimCSE, E5, BGE, GTE, OpenAI, Voyage AI, Cohere, and the MTEB leaderboard.

Embedding Models Deep Dive

Master embedding model selection for retrieval - MTEB benchmarks, model families, Matryoshka embeddings, bi-encoders vs cross-encoders, and fine-tuning strategies.

Embedding Quantization

Reducing embedding storage and search costs - float32 to float16, int8, and binary quantization, Hamming distance search, the rescoring trick, and implementation with FAISS and Qdrant.

Embedding Spaces

How token embeddings form a dense vector space that captures semantic meaning - geometry, anisotropy, weight tying, and visualization.

Embeddings in Production

Build, deploy, and operate production-grade embedding pipelines - caching, incremental indexing, staleness management, vector DB selection, and cost optimization at scale.

Encoder vs Decoder vs Encoder-Decoder

Comparing encoder-only, decoder-only, and encoder-decoder transformer architectures - when to use each and why decoder-only won.

Enhancing Hyperspace Analogue to Language (HAL) Representations via Attention-Based Pooling for Text Classification

The Hyperspace Analogue to Language (HAL) model relies on global word co-occurrence matrices to construct distributional semantic representations. While...

ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation

As corporate responsibility increasingly incorporates environmental, social, and governance (ESG) criteria, ESG reporting is becoming a legal requiremen...

EU AI Act and Global AI Regulation

The EU AI Act, US executive orders, UK AI policy, China AI regulations, and practical compliance implications for AI engineers building and deploying language models.

Evaluating Embedding Models

MTEB benchmark deep dive, nDCG@10, Recall@K, MRR, MAP, building domain-specific evaluation sets, running MTEB locally, and avoiding the contamination problem.

Evaluating Evidence Grounding Under User Pressure in Instruction-Tuned Language Models

In contested domains, instruction-tuned language models must balance user-alignment pressures against faithfulness to the in-context evidence. To evalua...

Evaluating Factual Density in Multi-Source RAG: A Study in Medical AI Accuracy

Retrieval-Augmented Generation (RAG) is the current industry standard for grounding AI in real-world facts. Traditional retrieval methods rely on keywor...

Evaluating Reasoning Models

The benchmark landscape for reasoning models - AIME, MATH-500, Codeforces, ARC-AGI, GPQA Diamond, process vs. outcome evaluation, and contamination concerns.

Evaluation of Deontic Conditional Reasoning in Large Language Models: The Case of Wason's Selection Task

As large language models (LLMs) advance in linguistic competence, their reasoning abilities are gaining increasing attention. In humans, reasoning often...

Exploration Hacking: Can LLMs Learn to Resist RL Training?

Reinforcement learning (RL) has become essential to the post-training of large language models (LLMs) for reasoning, agentic capabilities and alignment....

Feed-Forward Layers

The role of position-wise feed-forward networks in transformers - from the basic FFN to SwiGLU and Mixture of Experts.

Few-Shot Prompting

Master in-context learning by providing carefully selected examples that demonstrate the exact behavior you want - without any model fine-tuning.

Fine-Tuning Embedding Models for Your Domain

Contrastive fine-tuning with triplet loss, hard negative mining, in-batch negatives, synthetic data generation, TSDAE, GPL, and a full worked example on domain adaptation.

Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models

Transformer-based large language models exhibit in-context learning, enabling adaptation to downstream tasks via few-shot prompting with demonstrations....

FinSafetyBench: Evaluating LLM Safety in Real-World Financial Scenarios

Large language models (LLMs) are increasingly applied in financial scenarios. However, they may produce harmful outputs, including facilitating illegal...

Frankenmodels and Limitations of Model Merging

Layer grafting, depth upscaling, Solar 10.7B, and the fundamental limits of what model merging can and cannot achieve.

Frequency-Ordered Tokenization for Better Text Compression

We present frequency-ordered tokenization, a simple preprocessing technique that improves lossless text compression by exploiting the power-law frequenc...

From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text

The complexity of Vietnam's legal texts presents a significant barrier to public access to justice. While Large Language Models offer a promising soluti...

From Prompting to Preference Optimization: A Comparative Study of LLM-based Automated Essay Scoring

Large language models (LLMs) have recently reshaped Automated Essay Scoring (AES), yet prior studies typically examine individual techniques in isolatio...

From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models

Reinforcement learning (RL) for large language models (LLMs) increasingly relies on sparse, outcome-level rewards -- yet determining which actions withi...

Full Fine-Tuning vs PEFT: Decision Framework

A practical decision framework for choosing between full fine-tuning, LoRA, QLoRA, prompt tuning, and other PEFT methods based on your model size, data, and quality requirements.

Graph RAG

Master Microsoft's Graph RAG - build knowledge graphs from documents, use community detection for global queries, and understand when graph structure beats flat vector search.

Guardrails and Safety Systems

Build layered defense-in-depth safety systems for LLM applications - input filtering, toxicity detection, PII redaction, prompt injection defense, output validation, and human review escalation.

H-RAG at SemEval-2026 Task 8: Hierarchical Parent-Child Retrieval for Multi-Turn RAG Conversations

We present H-RAG, our submission to SemEval-2026 Task 8 (MTRAGEval), addressing both Task A (Retrieval) and Task C (Generation with Retrieved Passages)....

HMS-BERT: Hybrid Multi-Task Self-Training for Multilingual and Multi-Label Cyberbullying Detection

Cyberbullying on social media is inherently multilingual and multi-faceted, where abusive behaviors often overlap across multiple categories. Existing m...

HuggingFace Hub and Model Cards

Master the HuggingFace Hub as your primary interface for finding, evaluating, and deploying open-source models. Learn to read model cards, use the Hub API, and navigate 800k+ models efficiently.

Human Evaluation

Design rigorous human evaluation studies for LLMs - from annotation protocols to inter-annotator agreement to Chatbot Arena methodology.

Hybrid Architectures - Jamba and Beyond

How combining attention and Mamba layers creates models that outperform pure architectures - Jamba's design, the attention-to-Mamba ratio, MoE integration, and the emerging hybrid landscape.

Hybrid Search: Dense and Sparse

Combine BM25 sparse retrieval with dense vector search for best-of-both-worlds performance - understand SPLADE, fusion methods, and when hybrid beats pure dense.

If LLMs Have Human-Like Attributes, Then So Does Age of Empires II

Much research has been carried out on large language models (LLMs) and LLM-powered agentic workflows. However, many works within the field state emergen...

InCoder-32B-Thinking: Industrial Code World Model for Thinking

Industrial software development across chip design, GPU optimization, and embedded systems lacks expert reasoning traces showing how engineers reason ab...

Inference Cost Optimization

Learn how to systematically reduce LLM inference costs using model selection, quantization, caching, request routing, prompt compression, and infrastructure strategies.

Inference Optimization for MoE Models

Production techniques for serving MoE models efficiently - expert caching, CPU offloading, vLLM support, tensor vs. expert parallelism, batch size sensitivity, and quantization strategies.

InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models

Reducing the hardware footprint of large language models (LLMs) during decoding is critical for efficient long-sequence generation. A key bottleneck is...

Instruction Tuning

How instruction tuning transforms base LLMs into general-purpose assistants that can follow diverse instructions, reason step by step, and generalize to new tasks.

Instructor - Structured Outputs with Pydantic

A complete guide to Jason Liu's Instructor library - Pydantic-based structured extraction, automatic retry on validation failure, multi-provider support, streaming, and production extraction patterns.

Interpretable Semantic Gradients in SSD: A PCA Sweep Approach and a Case Study on AI Discourse

Supervised Semantic Differential (SSD) is a mixed quantitative-interpretive method that models how text meaning varies with continuous individual-differ...

Is Human Annotation Necessary? Iterative MBR Distillation for Error Span Detection in Machine Translation

Error Span Detection (ESD) is a crucial subtask in Machine Translation (MT) evaluation, aiming to identify the location and severity of translation erro...

Is More Data Worth the Cost? Dataset Scaling Laws in a Tiny Attention-Only Decoder

Training Transformer language models is expensive, as performance typically improves with increasing dataset size and computational budget. Although sca...

Jailbreaks and Adversarial Prompts

How safety training gets bypassed - jailbreak taxonomy, GCG attacks, many-shot jailbreaking, prompt injection, defenses, and why the arms race is hard to win.

JSON Mode and Tool/Function Schemas

A complete guide to native JSON mode, OpenAI Structured Outputs, tool calling for structured data, Anthropic tool use, parallel tool calls, and schema design best practices.

JumpLoRA: Sparse Adapters for Continual Learning in Large Language Models

Adapter-based methods have become a cost-effective approach to continual learning (CL) for Large Language Models (LLMs), by sequentially learning a low-...

KCLarity at SemEval-2026 Task 6: Encoder and Zero-Shot Approaches to Political Evasion Detection

This paper describes the KCLarity team's participation in CLARITY, a shared task at SemEval 2026 on classifying ambiguity and evasion techniques in poli...

KV Cache

Learn how the key-value cache eliminates redundant attention computation in LLM inference, and how PagedAttention solves the memory fragmentation problem.

LangChain Deep Dive

A thorough guide to LangChain's core abstractions, LCEL composable pipelines, LangGraph stateful workflows, LangSmith observability, and when to use LangChain vs direct API calls.

Language Modeling Objectives

Learn the training objectives that teach LLMs to understand language - causal language modeling, masked language modeling, cross-entropy loss, and perplexity.

Language Models Learn Constructional Semantics, Not To Mention Syntax: Investigating LM Understanding of Paired-Focus Constructions

Grasping the semantics of rare constructions (form-meaning pairings) has been shown to be a challenging problem that has currently only been solved by t...

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

Large language models (LLMs) undergo alignment training to avoid harmful behaviors, yet the resulting safeguards remain brittle: jailbreaks routinely by...

LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation

A speaker encoder used in multilingual voice cloning should treat the same speaker identically regardless of which script the audio was uttered in. Off-...

Latency and Cost Tradeoffs

How to decompose LLM latency and cost, choose the right optimization strategies, and define SLOs that balance quality, speed, and budget.

Latent-GRPO: Group Relative Policy Optimization for Latent Reasoning

Latent reasoning offers a more efficient alternative to explicit reasoning by compressing intermediate reasoning into continuous representations and sub...

Layer Normalization and Residual Connections

How layer normalization and residual connections solve gradient flow in deep transformers and enable training of 100+ layer networks.

Learning from Child-Directed Speech in Two-Language Scenarios: A French-English Case Study

Research on developmentally plausible language models has largely focused on English, leaving open questions about multilingual settings. We present a s...

Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory

Large language model (LLM) agents require long-term user memory for consistent personalization, but limited context windows hinder tracking evolving pre...

Learning the Signature of Memorization in Autoregressive Language Models

All prior membership inference attacks for fine-tuned language models use hand-crafted heuristics (e.g., loss thresholding, Min-K\%, reference calibrati...

Learning to Reason with Insight for Informal Theorem Proving

Although most of the automated theorem-proving approaches depend on formal proof systems, informal theorem proving can align better with large language...

Limitations of Attention at Scale

Why the quadratic complexity of self-attention creates real production bottlenecks - memory, latency, and cost - and why sparse attention approximations only partially solve the problem.

Linear Interpolation and Model Soup

How weight averaging of fine-tuned models produces better, more robust models than any individual fine-tune - and the task arithmetic framework for composing capabilities.

LLaMA Family Architecture

A deep dive into Meta's LLaMA model family - from LLaMA 1 through LLaMA 3.3 - covering RoPE embeddings, SwiGLU activation, RMSNorm, grouped query attention, and when to choose each variant.

LlamaIndex Deep Dive

A comprehensive guide to LlamaIndex's data-centric architecture - indices, query engines, workflows, multi-document agents, and how it compares to LangChain for RAG applications.

LLM Gateway and Routing

Design and operate an LLM gateway - unified API, model routing, circuit breakers, budget enforcement, and fallback chains - using LiteLLM and custom routing logic.

LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

Large language models (LLMs) perform increasingly well on biology benchmarks, but it remains unclear whether they uplift novice users -- i.e., enable hu...

LLM Product Architecture

The three fundamental LLM product patterns - chat, workflow automation, and autonomous agents - and how to design the production service graph for each.

LLM-as-Judge

Use powerful LLMs to automatically evaluate other models - with position bias mitigation, CoT judging, and cost analysis.

LLMs - Engineering Track

A structured, production-grade LLM curriculum - from transformer architecture to alignment and safety. 17 modules covering every layer of the LLM stack.

LMQL and Guidance - Programmatic LLM Control

How Microsoft Guidance and LMQL extend structured generation to full programmatic control - interleaving generation with code, SQL-like constraints, token healing, and when each tool wins over Outlines and Instructor.

LoASR-Bench: Evaluating Large Speech Language Models on Low-Resource Automatic Speech Recognition Across Language Families

Large language models (LLMs) have driven substantial advances in speech language models (SpeechLMs), yielding strong performance in automatic speech rec...

Long-form RewardBench: Evaluating Reward Models for Long-form Generation

The widespread adoption of reinforcement learning-based alignment highlights the growing importance of reward models. Various benchmarks have been built...

LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

Long-context reasoning remains a central challenge for large language models, which often fail to locate and integrate key information in extensive dist...

LoRA: Low-Rank Adaptation

Master LoRA - the parameter-efficient fine-tuning method that adds only 0.3% of parameters to GPT-3 while matching full fine-tuning quality, making LLM fine-tuning feasible on a single GPU.

Lost in the Middle - How LLMs Use Long Contexts

The empirical finding that LLMs reliably recall information at the beginning and end of long contexts but miss information in the middle, and strategies to mitigate this U-shaped performance degradation.

Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment

Although Automatic Speech Recognition (ASR) in Bengali has seen significant progress, processing long-duration audio and performing robust speaker diari...

Mamba - Selective State Space Models

How Mamba's input-dependent SSM parameters, hardware-aware parallel scan, and selective gating mechanism achieved linear-time sequence modeling competitive with transformers.

Mamba vs Transformer - When Each Wins

A rigorous benchmark comparison: perplexity, throughput, recall tasks, in-context learning, and the fundamental trade-off between compressed state and full context access.

Many Ways to Be Fake: Benchmarking Fake News Detection Under Strategy-Driven AI Generation

Recent advances in large language models (LLMs) have enabled the large-scale generation of highly fluent and deceptive news-like content. While prior wo...

Many-Tier Instruction Hierarchy in LLM Agents

Large language model agents receive instructions from many sources-system messages, user prompts, tool outputs, and more-each carrying different levels...

Mapping the Methodological Space of Classroom Interaction Research: Scale, Duration, and Modality in an Age of AI

Research on classroom interaction has long been divided between large-scale observation and in-depth ethnographic work. We propose a framework mapping t...

Masked Language Modeling and BERT

Understand how BERT learns bidirectional language representations using masked language modeling, its architecture, and how to fine-tune it for downstream tasks.

Matryoshka Representation Learning (MRL)

Nested embeddings where any prefix of dimensions is informative - training MRL, adaptive retrieval, 10x FLOP reduction, and how OpenAI's text-embedding-3 uses MRL internally.

Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation

Recent work on chain-of-thought (CoT) faithfulness reports single aggregate numbers (e.g., DeepSeek-R1 acknowledges hints 39% of the time), implying tha...

Measuring research data reuse in scholarly publications using generative artificial intelligence: Open Science Indicator development and preliminary results

Numerous metascience studies and other initiatives have begun to monitor the prevalence of open science practices when it is more important to understan...

Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory

Large language model (LLM) agents are fundamentally bottlenecked by finite context windows on long-horizon tasks. As trajectories grow, retaining tool o...

Memory Systems: Short-Term and Long-Term

Designing memory systems for LLM agents - from in-context working memory to episodic retrieval, semantic knowledge bases, and procedural memory.

Mending the Holes: Mitigating Reward Hacking in Reinforcement Learning for Multilingual Translation

Large Language Models (LLMs) have demonstrated remarkable capability in machine translation on high-resource language pairs, yet their performance on lo...

MergeKit - The Practical Toolkit

How to use arcee-ai/mergekit to merge language models with YAML configuration, CPU-compatible layer-by-layer processing, and automated HuggingFace Hub upload.

Mind the Gap: Pitfalls of LLM Alignment with Asian Public Opinion

Large Language Models (LLMs) are increasingly being deployed in multilingual, multicultural settings, yet their reliance on predominantly English-centri...

Mistral and Mixtral Architecture

Mistral 7B's sliding window attention and grouped query attention innovations, and Mixtral 8x7B's Mixture of Experts design - sparse routing, expert selection, and why MoE delivers 70B quality at 13B active parameter cost.

Mixtral 8x7B - Architecture Deep Dive

Mistral AI's Mixtral 8x7B architecture - 8 experts with top-2 routing, sliding window attention, multilingual training, performance vs. Llama 2 70B, and serving requirements.

Mixture of Experts Architecture

The architecture of sparse MoE models - how expert networks replace dense FFN layers, top-k routing, and how parameter count relates to active compute.

ML-Bench&Guard: Policy-Grounded Multilingual Safety Benchmark and Guardrail for Large Language Models

As Large Language Models (LLMs) are increasingly deployed in cross-linguistic contexts, ensuring safety in diverse regulatory and cultural environments...

MM-StanceDet: Retrieval-Augmented Multi-modal Multi-agent Stance Detection

Multimodal Stance Detection (MSD) is crucial for understanding public discourse, yet effectively fusing text and image, especially with conflicting sign...

Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs

Multimodal LLMs can process speech and images, but they cannot hear a speaker's voice or see an object's texture. We show this is not a failure of encod...

Model Licensing and Compliance

Open-source model licenses are not all the same. Learn Apache 2.0, LLaMA Community, RAIL, and custom licenses - what you can and cannot do in production, and how to build a compliance workflow.

Models Recall What They Violate: Constraint Adherence in Multi-Turn LLM Ideation

When researchers iteratively refine ideas with large language models, do the models preserve fidelity to the original objective? We introduce DriftBench...

Modern Alignment Techniques

Survey the post-RLHF alignment landscape - RLAIF, Constitutional AI, rejection sampling fine-tuning, iterative DPO, process reward models, and the open questions shaping the next generation of aligned models.

MoDora: Tree-Based Semi-Structured Document Analysis System

Semi-structured documents integrate diverse interleaved data elements (e.g., tables, charts, hierarchical paragraphs) arranged in various and often irre...

Module 01: Transformer Architecture

A complete guide to the transformer architecture - the foundation of every modern large language model.

Module 02 - Python for LLM Engineering Overview

Python patterns for building production LLM applications - API integration, streaming, prompt engineering, token management, tool use, and vector search.

Module 03: Prompt Engineering

Master the art and science of communicating with large language models - from basic zero-shot instructions to automated prompt optimization with DSPy.

Module 04: RAG Systems

Master Retrieval-Augmented Generation - the dominant pattern for grounding LLMs in external knowledge at production scale.

Module 06: LLM Evaluation

A complete guide to evaluating large language models - from perplexity to production monitoring.

Module 07: LLM Inference & Optimization

Master the systems and techniques that make large language model inference fast, efficient, and cost-effective at production scale.

Module 08: Multimodal Models

Understanding how modern AI systems process images, audio, and text together - from CLIP to diffusion to production pipelines.

Module 09: LLM System Design

Production architecture for AI-powered products - from prototype to reliable, scalable, cost-efficient systems.

Module 10: Reasoning Models

How modern LLMs learn to think - test-time compute, chain-of-thought, process reward models, and the architectures behind o1, o3, and DeepSeek-R1.

Module 11: Mixture of Experts

How sparse MoE models achieve massive capacity at lower compute cost - routing mechanisms, load balancing, Mixtral, and DeepSeek's innovations.

Module 12: State Space Models

A complete map of State Space Models - from the quadratic attention bottleneck to Mamba's selective recurrence, hybrid architectures, and production deployment.

Module 13: Structured Generation

A complete map of structured generation - from the reliability problem with free-text LLM output to constrained decoding, Outlines, Instructor, JSON mode, and production-grade extraction pipelines.

Module 14 Overview - Model Merging

How to combine multiple fine-tuned language models into a single, more capable model without any additional training.

Module 15 Overview - Long Context Strategies

How modern LLMs handle extremely long inputs - from the fundamental O(n²) attention problem to RoPE scaling, context compression, and production engineering for 128K+ context windows.

Module 16 - Alignment and Safety

A complete guide to AI alignment, RLHF, Constitutional AI, DPO, red teaming, jailbreaks, safety evaluations, and the global regulatory landscape.

Module 17 - Embeddings Engineering

A complete guide to embeddings - models, evaluation (MTEB), fine-tuning, Matryoshka embeddings, quantization, multimodal embeddings, and production pipelines.

Module 5: LLM Agents - Overview

LLM agents as autonomous systems that reason, plan, and act using tools, memory, and multi-agent coordination.

Monte Carlo Tree Search for LLM Reasoning

Adapting MCTS to language model reasoning - selection, expansion, simulation, backpropagation over reasoning steps, AlphaCode 2, Tree-of-Thought, and production trade-offs.

MT-PingEval: Evaluating Multi-Turn Collaboration with Private Information Games

We present a scalable methodology for evaluating language models in multi-turn interactions, using a suite of collaborative games that require effective...

MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations

We present MTRAG-UN, a benchmark for exploring open challenges in multi-turn retrieval augmented generation, a popular use of large language models. We...

Multi-Agent Architectures

Building systems where multiple specialized LLM agents collaborate through orchestrator-worker, pipeline, and peer-to-peer patterns using LangGraph and CrewAI.

Multi-Aspect Knowledge Distillation for Language Model with Low-rank Factorization

Knowledge distillation is an effective technique for pre-trained language model compression. However, existing methods only focus on the knowledge distr...

Multi-Head Attention

How multi-head attention enables transformers to jointly attend to information from multiple representation subspaces simultaneously.

Multimodal Embeddings

CLIP, SigLIP, ImageBind, ColPali, and CLAP - embedding images, text, audio, and documents in shared vector spaces for cross-modal search and zero-shot classification.

Multimodal Open Source Models

How open-source vision-language models work - from CLIP vision encoders and projection layers to LLaVA, InternVL2, and LLaMA 3.2 Vision - and how to deploy them for document understanding, OCR, and visual reasoning in production.

Multimodal RAG

How to build retrieval-augmented generation systems that can retrieve and reason over images, PDFs with figures, slides, and mixed-media documents.

Neuron-Aware Data Selection In Instruction Tuning For Large Language Models

Instruction Tuning (IT) has been proven to be an effective approach to unlock the powerful capabilities of large language models (LLMs). Recent studies...

No Universal Courtesy: A Cross-Linguistic, Multi-Model Study of Politeness Effects on LLMs Using the PLUM Corpus

This paper explores the response of Large Language Models (LLMs) to user prompts with different degrees of politeness and impoliteness. The Politeness T...

NOBLE: Accelerating Transformers with Nonlinear Low-Rank Branches

We introduce NOBLE (Nonlinear lOw-rank Branch for Linear Enhancement), an architectural augmentation that adds nonlinear low-rank branches to transforme...

Observability for LLM Apps

Build production observability for LLM applications - distributed tracing, quality metrics, cost attribution, prompt versioning, and drift detection using LangSmith, Langfuse, and Helicone.

On the Proper Treatment of Units in Surprisal Theory

Surprisal theory links human processing effort to the predictability of an upcoming linguistic unit, but empirical work often leaves the notion of a uni...

On the Rejection Criterion for Proxy-based Test-time Alignment

Recent works proposed test-time alignment methods that rely on a small aligned model as a proxy that guides the generation of a larger base (unaligned)...

Open Source Models

Run, fine-tune, quantize, evaluate, and deploy open source LLMs in production - the complete hands-on guide for engineers who want to own their models.

OpenAI Embeddings and API-Based Embedding Services

text-embedding-3, Matryoshka training, Voyage AI, Cohere Embed, cost analysis, batch processing patterns, and when to choose API vs self-hosted embeddings.

OpenAI o1 and o3 - Architecture and Training

What we know about OpenAI's o1 and o3 reasoning models - hidden chain-of-thought, reinforcement learning from process rewards, compute budget tokens, and ARC-AGI results.

Optimizing Korean-Centric LLMs via Token Pruning

This paper presents a systematic benchmark of state-of-the-art multilingual large language models (LLMs) adapted via token pruning - a compression techn...

Outlines - Grammar-Constrained Generation

A complete guide to the Outlines library - Pydantic schema to FSM, regex constraints, JSON schema constraints, vLLM integration, and production deployment patterns with guaranteed output conformance.

Perplexity and Language Model Metrics

Understand perplexity, cross-entropy, bits per byte, and when intrinsic metrics mislead you about model quality.

Phi and Small Language Models

Microsoft Phi model family - textbook quality data hypothesis, how 1-4B models can match much larger ones on reasoning tasks, and the design principles behind efficient small language models.

Planning and Reasoning

How LLM agents handle complex multi-step tasks through plan-and-execute, hierarchical planning, self-reflection, and LangGraph-based workflows.

Pointer-CAD: Unifying B-Rep and Command Sequences via Pointer-based Edges & Faces Selection

Constructing computer-aided design (CAD) models is labor-intensive but essential for engineering and manufacturing. Recent advances in Large Language Mo...

PONTE: Personalized Orchestration for Natural Language Trustworthy Explanations

Explainable Artificial Intelligence (XAI) seeks to enhance the transparency and accountability of machine learning systems, yet most methods follow a on...

Position: Vector Prompt Interfaces Should Be Exposed to Enable Customization of Large Language Models

As large language models (LLMs) transition from research prototypes to real-world systems, customization has emerged as a central bottleneck. While text...

Positional Encoding

How positional encodings inject sequence order information into transformers - from sinusoidal to RoPE.

Predicting States of Understanding in Explanatory Interactions Using Cognitive Load-Related Linguistic Cues

We investigate how verbal and nonverbal linguistic features, exhibited by speakers and listeners in dialogue, can contribute to predicting the listener'...

Preference Packing: Efficient Preference Optimization for Large Language Models

Resource-efficient training optimization techniques are becoming increasingly important as the size of large language models (LLMs) continues to grow. I...

Preference-Aware Rubric Learning for Personalized Evaluation

As Large Language Models (LLMs) evolve from general-purpose assistants to user-centric agents, personalization has become central to aligning model beha...

Pretraining at Scale

The infrastructure, parallelism strategies, memory optimizations, and training data choices required to pretrain large language models on thousands of GPUs.

PRISM: LLM-Guided Semantic Clustering for High-Precision Topics

In this paper, we propose Precision-Informed Semantic Modeling (PRISM), a structured topic modeling framework combining the benefits of rich representat...

PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning

The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforc...

Process Reward Models (PRMs)

How process reward models provide step-level supervision for reasoning - the Lightman et al. 2023 paper, Math-Shepherd, using PRMs for search, and their limitations.

Production Monitoring for LLMs

Build a comprehensive production monitoring stack for LLMs - latency, cost, quality drift, safety, and observability platforms compared.

Production Multimodal Systems

Build and operate multimodal AI pipelines at production scale - image preprocessing, cost control, VLM hallucination mitigation, caching, security, and observability for vision-language workloads.

Prompt Injection and Security

Understand how prompt injection attacks work, why they're hard to defend against, and how to build LLM systems that are resistant to manipulation.

Prompt Optimization and DSPy

Move beyond manual prompt engineering to automated, evaluation-driven optimization - using APE, OPRO, and DSPy to build LLM pipelines that improve themselves.

Prompt Templates in Python

Building maintainable prompt systems in Python - template engines, versioning, testing prompts, few-shot construction, and prompt injection defense.

QLoRA: Quantized Low-Rank Adaptation

Learn how QLoRA combines 4-bit quantization with LoRA to fine-tune 65B parameter models on a single consumer GPU, using NF4 quantization, double quantization, and paged optimizers.

Quantity Convergence, Quality Divergence: Disentangling Fluency and Accuracy in L2 Mandarin Prosody

While second language (L2) learners may acquire target syntactic word order, mapping this syntax onto appropriate prosodic structures remains a persiste...

Quantization: INT8 and INT4

Master LLM quantization techniques - from LLM.int8() to GPTQ and AWQ - to run large models on commodity hardware without unacceptable quality loss.

Qwen, DeepSeek, and International Models

Alibaba Qwen and DeepSeek architectural innovations - MLA attention, DeepSeekMoE, multi-token prediction, and how Chinese labs are advancing open-source LLM research.

RAG Evaluation

Build rigorous RAG evaluation with RAGAS, TruLens, LLM-as-judge, golden datasets, and production monitoring - measure faithfulness, relevance, and groundedness.

RAG Evaluation Metrics

Evaluate RAG systems with precision - the RAG triad, RAGAS framework, golden datasets, and retrieval metrics for production pipelines.

RAG vs Long Context - When to Use Each

A rigorous cost, latency, and accuracy comparison of retrieval-augmented generation versus long-context stuffing, with decision frameworks for production use cases.

ReAct Agent Pattern

Building LLM agents that interleave reasoning traces and actions in a ReAct loop to solve multi-step tasks with tool grounding.

ReAct Pattern

Learn how to build LLM agents that reason and act by interleaving thought and tool calls - the architectural pattern behind every modern AI assistant.

ReAct: Synergizing Reasoning and Acting in Language Models

Engineering breakdown of the ReAct paper (Yao et al., 2022) - the foundation of every AI agent built today. Plain English, production viability rating, implementation notes.

Reasoning Gets Harder for LLMs Inside A Dialogue

Large Language Models (LLMs) achieve strong performance on many reasoning benchmarks, yet these evaluations typically focus on isolated tasks that diffe...

RecaLLM: Addressing the Lost-in-Thought Phenomenon with Explicit In-Context Retrieval

We propose RecaLLM, a set of reasoning language models post-trained to make effective use of long-context information. In-context retrieval, which ident...

Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the complex reasoning capabilities of Large Reaso...

Red Teaming LLMs

Systematic adversarial evaluation of language models - manual red teaming, automated red teaming with LLMs, failure taxonomies, and building a production red team process.

Reliability Gated Multi-Teacher Distillation for Low Resource Abstractive Summarization

We study multiteacher knowledge distillation for low resource abstractive summarization from a reliability aware perspective. We introduce EWAD (Entropy...

Reliable Answers for Recurring Questions: Boosting Text-to-SQL Accuracy with Template Constrained Decoding

Large language models (LLMs) have revolutionized Text-to-SQL generation, allowing users to query structured data using natural language with growing eas...

Reliable Multilingual Orthopedic Decision Support from Clinical Narratives: Language-Aware Adaptation and Verification-Guided Deferral

Multilingual orthopedic decision support remains challenging in low-resource healthcare settings, where clinical narratives contain specialized terminol...

Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling

Recent research has shown that filtering massive English web corpora into high-quality subsets significantly improves training efficiency. However, for...

Reranking

Master the two-stage retrieval-reranking architecture - cross-encoders, ColBERT, LLM-as-reranker, Reciprocal Rank Fusion, and production latency budgets.

Research Roadmap: RLHF & Alignment

From InstructGPT to DPO to ORPO. Read the 7 most important alignment papers in order — understanding how LLMs are made to follow human intent.

Research Roadmap: The Evolution of AI Agents

From Chain-of-Thought to production agent architectures. Read the 9 most important agent papers in order — with full engineering context between each one.

Research Roadmap: The Evolution of Multimodal AI

From CLIP to GPT-4V to Gemini. Read the 9 most important multimodal AI papers in order — understanding how vision and language were unified.

Rethinking Multiple-Choice Questions for RLVR: Unlocking Potential via Distractor Design

Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances the reasoning capabilities of Large Language Models. When applied to RLVR,...

Retrieval Algorithms and ANN

Master the approximate nearest neighbor algorithms powering vector search - HNSW, IVF, IVF-PQ, ScaNN, and DiskANN with parameter tuning and recall-latency trade-offs.

Retrieval or Representation? Reassessing Benchmark Gaps in Multilingual and Visually Rich RAG

Retrieval-augmented generation (RAG) is a common way to ground language models in external documents and up-to-date information. Classical retrieval sys...

ReViSQL: Achieving Human-Level Text-to-SQL

Translating natural language to SQL (Text-to-SQL) is a critical challenge in both database research and data analytics applications. Recent efforts have...

RewardUQ: A Unified Framework for Uncertainty-Aware Reward Models

Reward models are central to aligning large language models (LLMs) with human preferences. Yet most approaches rely on pointwise reward estimates that o...

RLHF Deep Dive

A complete technical walkthrough of Reinforcement Learning from Human Feedback - the three-phase pipeline, reward models, PPO, KL penalty, and the limitations that led to newer approaches.

RLHF: Reinforcement Learning from Human Feedback

Understand how RLHF aligns LLMs with human preferences through three phases - SFT, reward model training, and PPO - and why it produced InstructGPT's surprising result that smaller aligned models beat larger unaligned ones.

RoPE and ALiBi - Positional Encoding for Long Context

How Rotary Position Embedding encodes relative positions through complex-plane rotations, why ALiBi achieves length extrapolation with linear biases, and why RoPE became the dominant approach for long-context models.

Router Mechanisms - How Tokens Get Assigned to Experts

The algorithms that decide which experts process which tokens - linear routing, expert choice, auxiliary load balancing loss, noisy top-k gating, and the Switch Transformer approach.

RouterKGQA: Specialized--General Model Routing for Constraint-Aware Knowledge Graph Question Answering

Knowledge graph question answering (KGQA) is a promising approach for mitigating LLM hallucination by grounding reasoning in structured and verifiable k...

RunAgent: Interpreting Natural-Language Plans with Constraint-Guided Execution

Humans solve problems by executing targeted plans, yet large language models (LLMs) remain unreliable for structured workflow execution. We propose RunA...

Safety and Bias Evaluation

Evaluate LLMs for harmful outputs, social bias, hallucination, and jailbreak vulnerability - including red teaming methodology and production monitoring.

SAHOO: Safeguarded Alignment for High-Order Optimization Objectives in Recursive Self-Improvement

Recursive self-improvement is moving from theory to practice: modern systems can critique, revise, and evaluate their own outputs, yet iterative self-mo...

Sampling Strategies: Temperature, Top-K, Top-P

Master the sampling algorithms that control LLM output diversity - from greedy decoding to nucleus sampling - and learn when to use each in production.

SC-Taxo: Hierarchical Taxonomy Generation under Semantic Consistency Constraints using Large Language Models

Scientific literature is expanding at an unprecedented pace, making it increasingly challenging to efficiently organize and access domain knowledge. A h...

Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning

The lack of reasoning capabilities in Vision-Language Models (VLMs) has remained at the forefront of research discourse. We posit that this behavior ste...

Scaling Laws

Empirical power-law relationships between LLM performance and compute, data, and parameters - from Kaplan (2020) to Chinchilla (2022) and beyond.

Self-Attention Mechanism

How self-attention computes query, key, and value interactions to capture long-range dependencies between tokens.

Self-Distilled RLVR

On-policy distillation (OPD) has become a popular training paradigm in the LLM community. This paradigm selects a larger model as the teacher to provide...

Semantic Invariance in Agentic AI

Large Language Models (LLMs) increasingly serve as autonomous reasoning agents in decision support, scientific problem-solving, and multi-agent coordina...

Semantic Token Clustering for Efficient Uncertainty Quantification in Large Language Models

Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks. However, the truthfulness of their outputs is not guarantee...

Semantic Triplet Restoration: A Novel Protocol for Hierarchical Table Understanding in Large Language Models

Table question answering requires models to recover semantic relations encoded implicitly by two-dimensional layout, merged cells, and hierarchical head...

Sentiment Analysis of German Sign Language Fairy Tales

We present a dataset and a model for sentiment analysis of German sign language (DGS) fairy tales. First, we perform sentiment analysis for three levels...

SLERP - Spherical Linear Interpolation

How spherical linear interpolation provides smoother, geometrically correct blending between two model weight configurations than simple linear averaging.

SongSong: A Time Phonograph for Chinese SongCi Music from Thousand of Years Away

Recently, there have been significant advancements in music generation. However, existing models primarily focus on creating modern pop songs, making it...

Sparse vs Dense Models - Trade-offs

Why MoE gives more capacity per FLOP than dense models, the memory vs. compute trade-off, training efficiency, inference complexity, and when to choose each architecture.

SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

Real-world Table-Text question answering (QA) tasks require models that can reason across long text and source tables, traversing multiple hops and exec...

Speak in Context: Multilingual ASR with Speech Context Alignment via Contrastive Learning

Automatic speech recognition (ASR) has benefited from advances in pretrained speech and language models, yet most systems remain constrained to monoling...

Speculative Decoding

Learn how speculative decoding uses a small draft model to generate tokens that a large target model verifies in parallel, achieving 2-3x speedup with no quality loss.

Stable Behavior, Limited Variation: Persona Validity in LLM Agents for Urban Sentiment Perception

Large Language Models (LLMs) are increasingly used as proxies for human perception in urban analysis, yet it remains unclear whether persona prompting p...

State Space Model Foundations

How control theory's state space models became a competitive sequence modeling architecture - continuous-time SSMs, the S4 paper, HiPPO initialization, and the convolutional/recurrent duality.

StoryScope: Investigating idiosyncrasies in AI fiction

As AI-generated fiction becomes increasingly prevalent, questions of authorship and originality are becoming central to how written work is evaluated. W...

Streaming LLM Responses

Streaming LLM output in Python - server-sent events, async generators, FastAPI streaming endpoints, and building real-time chat UIs.

Structured Distillation for Personalized Agent Memory: 11x Token Reduction with Retrieval Preservation

Long conversations with an AI agent create a simple problem for one user: the history is useful, but carrying it verbatim is expensive. We study persona...

Structured Generation in Production

Production-grade architecture for structured generation pipelines - reliability stacks, schema versioning, monitoring, async batching, caching, edge case handling, and complete reference implementations.

Structured Output and JSON Mode

Reliably extract structured data from LLMs using JSON mode, function calling, Pydantic validation, and constrained decoding - the backbone of production LLM pipelines.

Supervised Fine-Tuning

Learn how to adapt pretrained LLMs to specific tasks through supervised fine-tuning - data preparation, hyperparameters, catastrophic forgetting, and evaluation.

SwanNLP at SemEval-2026 Task 5: An LLM-based Framework for Plausibility Scoring in Narrative Word Sense Disambiguation

Recent advances in language models have substantially improved Natural Language Understanding (NLU). Although widely used benchmarks suggest that Large...

Synthetic Computers at Scale for Long-Horizon Productivity Simulation

Realistic long-horizon productivity work is strongly conditioned on user-specific computer environments, where much of the work context is stored and or...

System Prompts and Context Design

Master the architecture of LLM conversations - how to design system prompts, manage context windows, and build production-grade context management systems.

Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation

Modern optimizers like Adam and Muon are central to training large language models, but their reliance on first- and second-order momenta introduces sig...

Task Complexity Matters: An Empirical Study of Reasoning in LLMs for Sentiment Analysis

Large language models (LLMs) with reasoning capabilities have fueled a compelling narrative that reasoning universally improves performance across langu...

Task-Centric Acceleration of Small-Language Models

Small language models (SLMs) have emerged as efficient alternatives to large language models for task-specific applications. However, they are often emp...

TaxonRL: Reinforcement Learning with Intermediate Rewards for Interpretable Fine-Grained Visual Reasoning

Traditional vision-language models struggle with contrastive fine-grained taxonomic reasoning, particularly when distinguishing between visually similar...

Tensor and Pipeline Parallelism

Learn how tensor parallelism splits weight matrices across GPUs and pipeline parallelism splits model layers, enabling inference and training of models too large for a single GPU.

Terminology Rarity Predicts Catastrophic Failure in LLM Translation of Low-Resource Ancient Languages: Evidence from Ancient Greek

This study presents the first systematic, reference-free human evaluation of large language model (LLM) machine translation (MT) for Ancient Greek (AG)...

Test-Time Compute - Scaling at Inference

The paradigm shift from training-time scaling to inference-time scaling - best-of-N sampling, majority voting, and how spending more compute at inference improves reasoning quality.

The Alignment Problem

Why making AI systems do what we actually want is harder than it looks - the specification problem, Goodhart's Law, reward hacking, and outer vs inner alignment.

The Art That Poses Back: Assessing AI Pastiches after Contemporary Artworks

This study explores artificial visual creativity, focusing on ChatGPT's ability to generate new images intentionally pastiching original artworks such a...

The Challenge of Attention at Long Contexts

Why attention is O(n²) in memory and compute, how the KV cache grows with context length, and how FlashAttention solves the IO bottleneck without changing the algorithm.

The Company You Keep: How LLMs Respond to Dark Triad Traits

Large Language Models (LLMs) often exhibit highly agreeable and reinforcing conversational styles, also known as AI-sycophancy. Although this behavior i...

The EpisTwin: A Knowledge Graph-Grounded Neuro-Symbolic Architecture for Personal AI

Personal Artificial Intelligence is currently hindered by the fragmentation of user data across isolated silos. While Retrieval-Augmented Generation off...

TIES Merging - Resolving Sign Conflicts

How TIES-Merging eliminates task interference by trimming small deltas, electing signs by majority vote, and merging only aligned parameters.

Token Counting and Context Management

Tiktoken, tokenisation internals, context window management, sliding window strategies, and building cost-aware LLM applications.

Tokenization Deep Dive

How tokenizers convert raw text to token IDs - BPE from scratch, WordPiece, byte-level BPE, and the surprising ways tokenization breaks models.

Tool Use and Function Calling

Enabling LLMs to invoke external tools and APIs through structured function calling, covering JSON schema design, Anthropic vs OpenAI formats, parallel tool calls, and production safety.

Tool Use from Python

Building LLM tool use systems in Python -- function calling, tool schemas, execution loops, error handling, and multi-step agent patterns.

TopBench: A Benchmark for Implicit Prediction and Reasoning over Tabular Question Answering

Large Language Models (LLMs) have advanced Table Question Answering, where most queries can be answered by extracting information or simple aggregation....

Toward Automatic Filling of Case Report Forms: A Case Study on Data from an Italian Emergency Department

Case Report Forms (CRFs) collect data about patients and are at the core of well-established practices to conduct research in clinical settings. With th...

Toward Guarantees for Clinical Reasoning in Vision Language Models via Formal Verification

Vision-language models (VLMs) show promise in drafting radiology reports, yet they frequently suffer from logical inconsistencies, generating diagnostic...

Training MoE Models

How to train Mixture of Experts models at scale - expert parallelism, capacity factors, token dropping, load imbalance, training instability, and the GShard approach to 600B parameters.

Transparent AI for Mathematics: Transformer-Based Large Language Models for Mathematical Entity Relationship Extraction with XAI

Mathematical text understanding is a challenging task due to the presence of specialized entities and complex relationships between them. This study for...

Tree-of-Thought Prompting

Explore multiple reasoning paths simultaneously using Tree-of-Thought - the technique that enables LLMs to backtrack, evaluate alternatives, and solve problems that defeat linear chain-of-thought.

UIPress: Bringing Optical Token Compression to UI-to-Code Generation

UI-to-Code generation requires vision-language models (VLMs) to produce thousands of tokens of structured HTML/CSS from a single screenshot, making visu...

Uncertainty Quantification for Multimodal Large Language Models with Incoherence-adjusted Semantic Volume

Despite their capabilities, Multimodal Large Language Models (MLLMs) may produce plausible but erroneous outputs, hindering reliable deployment. Accurat...

UniAudio-Token: Empowering Semantic Speech Tokenizers with General Audio Perception

Semantic speech tokenizers have become a widely used interface for Audio-LLMs, owing to their compact single-codebook design and strong linguistic align...

Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity

Reinforcement Learning with Verifiable Rewards (RLVR) has achieved substantial gains in single-attempt accuracy (Pass@1) on reasoning tasks, yet often s...

Universal statistical laws governing culinary design

Cooking is a cultural expression of human creativity that transcends geography and time through the orchestration of ingredients and techniques, much li...

Valence-Arousal Subspace in LLMs: Circular Emotion Geometry and Multi-Behavioral Control

We present a method to identify a valence-arousal (VA) subspace within large language model representations. From 211k emotion-labeled texts, we derive...

Vector Databases

Compare Pinecone, Qdrant, Weaviate, Milvus, Chroma, and pgvector - understand the engineering trade-offs and build a production vector store.

VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking

Video agentic models have advanced challenging video-language tasks. However, most agentic approaches still heavily rely on greedy parsing over densely...

Vision-Language Models

How modern AI systems combine vision encoders with language models to understand and reason about images.

Vision-Language Models Suppress Female Representations Under Ambiguous Input

Alignment teaches vision-language models (VLMs) to avoid expressing demographic biases, and when gender is clearly visible they largely succeed. Far les...

VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images

Vision-language models (VLMs) still struggle with visual perception tasks such as spatial understanding and viewpoint recognition. One plausible contrib...

VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning

Large Vision Language Models (LVLMs) achieve strong multimodal reasoning but frequently exhibit hallucinations and incorrect responses with high certain...

vLLM and Inference Servers

Learn how production inference servers like vLLM, TGI, TensorRT-LLM, and Ollama combine PagedAttention, continuous batching, and optimized kernels to serve LLMs at scale.

What Am I Missing? Question-Answering as Hidden State Probing

Test-time reasoning has become a significant field of study since the introduction of chain-of-thought reasoning in large language models (LLMs). Howeve...

What Are Embeddings and Why They Matter

The fundamental concept of embeddings - mapping meaning to geometric space, cosine similarity, Word2Vec, the king-queen analogy, and why dense retrieval replaced keyword search.

What Gets Unmasked First? Trajectory Analysis of Diffusion Models for Graph-to-Text Generation

We present the first systematic study of masked diffusion language models (MDLMs) for graph-to-text generation. We analyze MDLM generation trajectories...

When Contextual Inference Fails: Cancelability in Interactive Instruction Following

We investigate the separation of literal interpretation from contextual inference in a collaborative block-building task where a builder must resolve un...

When Do Language Models Endorse Limitations on Human Rights Principles?

As Large Language Models (LLMs) increasingly mediate global information access with the potential to shape public discourse, their alignment with univer...

When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models

Large language models (LLMs) often achieve strong performance on reasoning benchmarks, but final-answer accuracy alone does not show whether they faithf...

When RAG Chatbots Expose Their Backend: An Anonymized Case Study of Privacy and Security Risks in Patient-Facing Medical AI

Background: Patient-facing medical chatbots based on retrieval-augmented generation (RAG) are increasingly promoted to deliver accessible, grounded heal...

When to Use Reasoning Models in Production

A practical decision framework for routing tasks to reasoning models - task taxonomy, cost-benefit analysis, latency trade-offs, and hybrid routing architectures.

When to Use SSMs in Production

A practical deployment guide: use cases where SSMs win, the streaming inference pattern, model availability on HuggingFace, fine-tuning SSMs, and a forward-looking outlook.

Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?

Diffusion Language Models (DLMs) are often advertised as enabling parallel token generation, yet practical fast DLMs frequently converge to left-to-righ...

Why Model Merging Exists

The catastrophic forgetting problem, why naive ensembles are too expensive, and the surprising geometric insight that makes model merging possible.

Why RAG and When Not To

Understand why LLMs hallucinate, what RAG actually solves, and the decision framework for choosing RAG vs fine-tuning vs prompt stuffing.

Why Structured Output Matters in Production

The taxonomy of LLM output failures, why prompt-based JSON extraction breaks at scale, the production impact of 5% failure rates, and the spectrum of solutions from prompt engineering to constrained decoding.

Working with 128K+ Context Windows in Production

A complete production engineering guide for building applications with long-context LLMs - model selection, cost management, prompt structure, multi-turn conversation, and memory-augmented systems.

World Properties without World Models: Recovering Spatial and Temporal Structure from Co-occurrence Statistics in Static Word Embeddings

Recent work interprets the linear recoverability of geographic and temporal variables from large language model (LLM) hidden states as evidence for worl...

You Can't Fight in Here! This is BBS!

Norm, the formal theoretical linguist, and Claudette, the computational language scientist, have a lovely time discussing whether modern language models...

Zero-Shot Prompting

Learn how to elicit reliable behavior from LLMs using only instructions - no examples required - by mastering prompt anatomy, role personas, and format control.