2689 docs tagged with "deep-dive"

init and Object Construction - Two-Phase Creation at Engineering Depth

Understand how Python actually constructs objects - the difference between __new__ and __init__, two-phase creation, mutable default argument traps, super().__init__() in inheritance chains, and factory patterns with classmethods.

__init_subclass__ - The Modern Alternative to Metaclasses

Master __init_subclass__ for subclass registration, definition-time validation, plugin registries, and keyword arguments in class statements - the Pythonic replacement for most metaclass use cases.

__set_name__ - The Descriptor Naming Protocol

Understand __set_name__, Python's descriptor self-naming protocol - how it eliminates name redundancy, how type.__new__ calls it, and how Django, Pydantic, and SQLAlchemy use it to build self-configuring field systems.

01 - Agent Risk Taxonomy

Eight categories of agent risk, the confused deputy problem, severity matrices, and a Python risk assessment module.

01 - Task Decomposition

How agents break complex goals into ordered, dependency-tracked subtasks. Hierarchical decomposition, DAG representation, dynamic replanning, and full Python implementation.

02 - Minimal Footprint Principle

Least privilege, reversibility preference, scope confirmation, and a Python minimal-footprint agent wrapper.

02 - Planning with LLMs

Zero-shot, chain-of-thought, Tree of Thoughts, ReWOO, and MCTS-guided planning. When LLM plans fail and how to recover. Full Python implementation of Tree of Thoughts.

03 - Checkpointing and Recovery

How to save agent state mid-run, resume after failures, design idempotent actions, and build production-grade checkpoint systems with SQLite and S3.

03 - Prompt Injection in Agents

Indirect prompt injection attacks, real-world examples, detection and defense strategies, and a Python injection defense system.

04 - Guardrails and Action Validation

Pre- and post-action guardrails, composable validators, denylist enforcement, rate limiting, and a complete Python guardrail pipeline.

04 - Handling Ambiguity and Clarification

How agents detect ambiguous instructions, decide when to ask vs. proceed, design targeted clarification questions, and avoid the overly-cautious anti-pattern.

05 - Interruption and Human-in-the-Loop

When and how agents pause for human judgment. Action classification, async approval workflows, Slack-based HITL, and resuming after interruption.

06 - Evaluation of Long-Horizon Tasks

How to evaluate multi-step agent trajectories. Task completion, path quality, error recovery, efficiency, and LLM-as-judge. Benchmarks and trajectory scorers.

3D HAMSTER: Bridging Planning and Control in Hierarchical Vision Language Action Models through 3D Trajectory Guidance

Hierarchical Vision-Language-Action (VLA) models decouple high-level planning from low-level control to improve generalization in robot manipulation. Re...

3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding

Large multimodal models are increasingly used as the reasoning core of embodied agents operating in 3D environments, yet they remain prone to hallucinat...

3DTCR: A Physics-Based Generative Framework for Vortex-Following 3D Reconstruction to Improve Tropical Cyclone Intensity Forecasting

Tropical cyclone (TC) intensity forecasting remains challenging as current numerical and AI-based weather models fail to satisfactorily represent extrem...

3DTV: A Feedforward Interpolation Network for Real-Time View Synthesis

Real-time free-viewpoint rendering requires balancing multi-camera redundancy with the latency constraints of interactive applications. We address this...

4D Human-Scene Reconstruction from Low-Overlap Captures

Existing volumetric capture of dynamic human performance achieves high fidelity with dense camera arrays. However, in real-world scenarios, only a handf...

4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

Dynamic spatial reasoning from monocular video is essential for bridging visual intelligence and the physical world, yet remains challenging for vision-...

A 1/R Law for Kurtosis Contrast in Balanced Mixtures

Kurtosis-based Independent Component Analysis (ICA) weakens in wide, balanced mixtures. We prove a sharp redundancy law: for a standardized projection w...

A Bayesian Updating Framework for Long-term Multi-Environment Trial Data in Plant Breeding

In variety testing, multi-environment trials (MET) are essential for evaluating the genotypic performance of crop plants. A persistent challenge in the...

A Benchmark for Interactive World Models with a Unified Action Generation Framework

Achieving Artificial General Intelligence (AGI) requires agents that learn and interact adaptively, with interactive world models providing scalable env...

A Constrained RL Approach for Cost-Efficient Delivery of Latency-Sensitive Applications

Next-generation networks aim to provide performance guarantees to real-time interactive services that require timely and cost-efficient packet delivery....

A Dataset is Worth 1 MB

A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate o...

A Dirac-Frenkel-Onsager principle: Instantaneous residual minimization with gauge momentum for nonlinear parametrizations of PDE solutions

Dirac-Frenkel instantaneous residual minimization evolves nonlinear parametrizations of PDE solutions in time, but ill-conditioning can render the param...

A distributed semismooth Newton based augmented Lagrangian method for distributed optimization

This paper proposes a novel distributed semismooth Newton based augmented Lagrangian method for solving a class of optimization problems over networks,...

A Federated Many-to-One Hopfield model for associative Neural Networks

Federated learning enables collaborative training without sharing raw data, but struggles under client heterogeneity and streaming distribution shifts,...

A Foundation Model for Zero-Shot Logical Rule Induction

Inductive Logic Programming (ILP) learns interpretable logical rules from data. Existing methods are transductive: their learned parameters are bound to...

A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens

Anticipating diverse future states is a central challenge in video world modeling. Discriminative world models produce a deterministic prediction that i...

A Hybrid Approach for Closing the Sim2real Appearance Gap in Game Engine Synthetic Datasets

Video game engines have been an important source for generating large volumes of visual synthetic datasets for training and evaluating computer vision a...

A Learning-based Multi-Frame Visual Feature Framework for Real-Time Driver Fatigue Detection.

A Learning-based Multi-Frame Visual Feature Framewor... - published at NAACL 2025.

A multimodal slice discovery framework for systematic failure detection and explanation in medical image classification

Despite advances in machine learning-based medical image classifiers, the safety and reliability of these systems remain major concerns in practical set...

A New Kernel Regularity Condition for Distributed Mirror Descent: Broader Coverage and Simpler Analysis

Existing convergence of distributed optimization methods in non-Euclidean geometries typically rely on kernel assumptions: (i) global Lipschitz smoothne...

A Note on How to Remove the $\ln\ln T$ Term from the Squint Bound

In Orabona and Pál [2016], we introduced the shifted KT potentials, to remove the $\ln \ln T$ factor in the parameter-free learning with expert bound. I...

A note on the area under the likelihood and the fake evidence for model selection

Improper priors are not allowed for the computation of the Bayesian evidence $Z=p({f y})$ (a.k.a., marginal likelihood), since in this case $Z$ is not...

A Novel Computational Framework for Causal Inference: Tree-Based Discretization with ILP-Based Matching

Causal inference is essential for data-driven decision-making, as it aims to uncover causal relationships from observational data. However, identifying...

A novel hybrid approach for positive-valued DAG learning

Causal discovery from observational data remains a fundamental challenge in machine learning and statistics, particularly when variables represent inher...

A Practical Analysis of Human Alignment with *PO.

A Practical Analysis of Human Alignment with *PO. - published at NAACL 2025.

A Predictive View on Streaming Hidden Markov Models

We develop a predictive-first optimisation framework for streaming hidden Markov models. Unlike classical approaches that prioritise full posterior reco...

A Proper Scoring Rule for Virtual Staining

Generative virtual staining (VS) models for high-throughput screening (HTS) can provide an estimated posterior distribution of possible biological featu...

A Quantitative Characterization of Forgetting in Post-Training

Continual post-training of generative models is widely used, yet a principled understanding of when and why forgetting occurs remains limited. We develo...

A Quantized Native Runtime for On-Device Semantic Audio Generation

Semantic audio applications increasingly require controllable generation on commodity and embedded hardware rather than through framework-heavy datacent...

A recipe for scalable attention-based MLIPs: unlocking long-range accuracy with all-to-all node attention

Machine-learning interatomic potentials (MLIPs) have advanced rapidly, with many top models relying on strong physics-based inductive biases. However, a...

A Reference Architecture of Reinforcement Learning Frameworks

The surge in reinforcement learning (RL) applications gave rise to diverse supporting technology, such as RL frameworks. However, the architectural patt...

A Self-Evolving Framework for Efficient Terminal Agents via Observational Context Compression

As model capabilities advance, research has increasingly shifted toward long-horizon, multi-turn terminal-centric agentic tasks, where raw environment f...

A Semantic-Aware Layer-Freezing Approach to Computation-Efficient Fine-Tuning of Language Models.

A Semantic-Aware Layer-Freezing Approach to Computat... - published at ACL 2025.

A Sovereign, Open-Source Foundation Model for German and English

We present Soofi S 30B-A3B, a sovereign, open-source Mixture-of-Experts (MoE) hybrid Mamba Transformer foundation model for German and English. Its hybr...

A Sparse and Truncated State Vector Simulator for Peaked Circuits

In a class of quantum circuits known as peaked circuits, the goal is to predict the most probable bit string at the output of the circuit. Since these c...

A Stein Identity for q-Gaussians with Bounded Support

Stein's identity is a fundamental tool in machine learning with applications in generative models, stochastic optimization, and other problems involving...

A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning

As audio-visual multi-modal large language models (MLLMs) are increasingly deployed in safety-critical applications, understanding their vulnerabilities...

A Temporally Augmented Graph Attention Network for Affordance Classification

Graph attention networks (GATs) provide one of the best frameworks for learning node representations in relational data; but, existing variants such as...

A Theory of Contrastive Learning with Natural Images

Why does contrastive learning with simple images and augmentations yield useful representations for downstream tasks? We address this question by analyt...

A theory of learning data statistics in diffusion models, from easy to hard

While diffusion models have emerged as a powerful class of generative models, their learning dynamics remain poorly understood. We address this issue fi...

A Tight Theory of Error Feedback Algorithms in Distributed Optimization

Communication costs are a major bottleneck in distributed learning and first-order optimization. A common approach to alleviate this issue is to compres...

A Training-free LLM-based Approach to General Chinese Character Error Correction.

A Training-free LLM-based Approach to General Chines... - published at ACL 2025.

A Tsetlin Machine-driven Intrusion Detection System for Next-Generation IoMT Security

The rapid adoption of the Internet of Medical Things (IoMT) is transforming healthcare by enabling seamless connectivity among medical devices, systems,...

A two-step sequential approach for hyperparameter selection in finite context models

Finite-context models (FCMs) are widely used for compressing symbolic sequences such as DNA, where predictive performance depends critically on the cont...

A unified perspective on fine-tuning and sampling with diffusion and flow models

We study the problem of training diffusion and flow generative models to sample from target distributions defined by an exponential tilting of a base de...

A Variational Estimator for $L_p$ Calibration Errors

Calibration - the problem of ensuring that predicted probabilities align with observed class frequencies - is a basic desideratum for reliable ML prediction.

A^2RD: Agentic Autoregressive Diffusion for Long Video Consistency

Synthesizing consistent and coherent long video remains a fundamental challenge. Existing methods suffer from semantic drift and narrative collapse over...

A^2TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

Reinforcement learning for agentic large language models (LLMs) typically relies on a sparse, trajectory-level outcome reward, making it difficult to ev...

AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research.

AbGen: Evaluating Large Language Models in Ablation... - published at ACL 2025.

ABot-AgentOS: A General Robotic Agent OS with Lifelong Multi-modal Memory

Recent VLM and VLA systems have improved robotic perception and action prediction, yet long-horizon embodied agents still require a general runtime laye...

ABot-AgentOS: A General Robotic Agent OS with Lifelong Multi-modal Memory

Recent VLM and VLA systems have improved robotic perception and action prediction, yet long-horizon embodied agents still require a general runtime laye...

ABot-N1: Toward a General Visual Language Navigation Foundation Model

Visual Language Navigation foundation models aim to unify deep reasoning for grounded spatial decisions with broad versatility for diverse embodied task...

ABot-N1: Toward a General Visual Language Navigation Foundation Model

Visual Language Navigation foundation models aim to unify deep reasoning for grounded spatial decisions with broad versatility for diverse embodied task...

Abstain-R1: Calibrated Abstention and Post-Refusal Clarification via Verifiable RL

Reinforcement fine-tuning improves the reasoning ability of large language models, but it can also encourage them to answer unanswerable queries by gues...

Abstract Base Classes - Enforcing Interfaces at Engineering Depth

Master Python's ABC system - abc.ABC, @abstractmethod, ABCMeta, virtual subclasses via register(), collections.abc built-in protocols, using ABCs in type hints, and the ABCs vs typing.Protocol trade-off.

AcademiClaw: When Students Set Challenges for AI Agents

Benchmarks within the OpenClaw ecosystem have thus far evaluated exclusively assistant-level tasks, leaving the academic-level capabilities of OpenClaw...

Accelerating Speculative Decoding with Block Diffusion Draft Trees

Speculative decoding accelerates autoregressive language models by using a lightweight drafter to propose multiple future tokens, which the target model...

Accurate and Efficient Hybrid-Ensemble Atmospheric Data Assimilation in Latent Space with Uncertainty Quantification

Data assimilation (DA) combines model forecasts and observations to estimate the optimal state of the atmosphere with its uncertainty, providing initial...

Accurate and Reliable Uncertainty Estimates for Deterministic Predictions Extensions to Under and Overpredictions

Computational models support high-stakes decisions across engineering and science, and practitioners increasingly seek probabilistic predictions to quan...

Accurate and scalable exchange-correlation with deep learning

Density Functional Theory (DFT) underpins much of modern computational chemistry and materials science. Yet, the reliability of DFT-derived predictions...

Accurate, Interdisciplinary and Transparent Structure-property Understanding with Deep Native Structural Reasoning

Structure-property relationships are foundational to biology, chemistry and materials science, where function, reactivity and physical response emerge f...

ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation

Selecting LLM-generated code candidates using LLM-generated tests is challenging because the tests themselves may be incorrect. Existing methods either...

Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling

Recent progress in reasoning models has substantially advanced long-horizon mathematical and scientific problem solving, with several systems now reachi...

Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

The advent of agentic multimodal models has empowered systems to actively interact with external environments. However, current agents suffer from a pro...

Action Images: End-to-End Policy Learning via Multiview Video Generation

World action models (WAMs) have emerged as a promising direction for robot policy learning, as they can leverage powerful video backbones to model the f...

Activation Functions

Complete guide to activation functions - sigmoid saturation proofs, dying ReLU mechanics, GELU/Swish/SiLU for modern transformers, PReLU, ELU, SELU, Mish, and a full selection guide with NumPy and PyTorch implementations.

Active Bipartite Ranking with Smooth Posterior Distributions

In this article, bipartite ranking, a statistical learning problem involved in many applications and widely studied in the passive context, is approache...

Active Few-Shot Learning for Text Classification.

How to intelligently select which examples to annotate when you only have a handful of labeled samples per class. Combines active learning with few-shot text classification to minimize annotation cost - directly applicable to intent detection, content moderation, and domain-specific NLP tasks.

Active Learning

Selecting the most informative samples for labeling - uncertainty sampling, diversity strategies, query-by-committee, and LLM-based active learning for text classification.

Ad Click Prediction at Scale

End-to-end design of a production ad click prediction system - covering Wide and Deep learning, feature engineering at scale, online learning, calibration, and serving under 10ms.

AdaCubic: An Adaptive Cubic Regularization Optimizer for Deep Learning

A novel regularization technique, AdaCubic, is proposed that adapts the weight of the cubic term. The heart of AdaCubic is an auxiliary optimization pro...

Adam's Law: Textual Frequency Law on Large Language Models

While textual frequency has been validated as relevant to human cognition in reading speed, its relatedness to Large Language Models (LLMs) is seldom st...

Adaptive Combinatorial Experimental Design: Pareto Optimality for Decision-Making and Inference

In this paper, we provide the first investigation into adaptive combinatorial experimental design, focusing on the trade-off between regret minimization...

Adaptive Conditional Forest Sampling for Spectral Risk Optimisation under Decision-Dependent Uncertainty

Minimising a spectral risk objective, defined as a convex combination of expected cost and Conditional Value-at-Risk (CVaR), is challenging when the unc...

Adaptive Learning Systems

Learn how adaptive learning systems model student knowledge state and sequence educational content using IRT, CAT, spaced repetition, and multi-armed bandits to maximize learning outcomes.

Adaptive multi-fidelity optimization with fast learning rates

In multi-fidelity optimization, biased approximations of varying costs of the target function are available. This paper studies the problem of optimizin...

Adaptive Querying with AI Persona Priors

We study adaptive querying for learning user-dependent quantities of interest, such as responses to held-out items and psychometric indicators, within t...

AdaState: Self-Evolving Anchors for Streaming Video Generation

Autoregressive video diffusion models generate streaming video by producing frames sequentially, conditioning each chunk on previously generated content...

ADD for Multi-Bit Image Watermarking

As generative models enable rapid creation of high-fidelity images, societal concerns about misinformation and authenticity have intensified. A promisin...

Advanced Event Loop

Master event loop internals including selectors, callbacks, timers, custom policies, uvloop, run_in_executor, and signal handling for production async systems.

Advanced Generic Patterns

Master Self type, TypeVarTuple, recursive types, generic protocols, and generic type aliases for framework-level type-safe design including builder patterns and tensor shape typing.

Advanced PEFT Methods

Beyond LoRA - Prefix Tuning, Prompt Tuning, IA3, AdaLoRA, VeRA, and LoftQ. When to reach for each method, how they compare on parameter count and quality, and practical implementation with the PEFT library.

Advanced Prompting Techniques

Master self-refinement, Tree of Thought, ReAct, meta-prompting, and other advanced techniques for reliable, sophisticated LLM behavior in production.

Advanced RAG Patterns

Go beyond naive RAG - master query transformation, HyDE, multi-query retrieval, Self-RAG, Corrective RAG, and iterative retrieval patterns for complex questions.

Advanced Spark Performance Tuning for ML Workloads

Systematic techniques to diagnose and eliminate Spark bottlenecks - data skew, shuffle overhead, memory pressure, and suboptimal joins - reducing job time and cost by 10x.

AdvancedMathBench: A Benchmark Suite for Advanced Mathematical Proof Generation and Verification

Large language models (LLMs) have achieved remarkable performance on high-school and olympiad-style mathematics, yet their capabilities on advanced math...

Advancing Creative Physical Intelligence in Large Multimodal Models

Large multimodal models (LMMs) have rapidly advanced in perception and reasoning; however, it remains unclear whether these capabilities generalize to d...

Advancing Language Models through Instruction Tuning: Recent Progress and Challenges.

Advancing Language Models through Instruction Tuning... - published at EMNLP 2025.

Advancing Polish Language Modeling through Tokenizer Optimization in the Bielik v3 7B and 11B Series

The development of the Bielik v3 PL series, encompassing both the 7B and 11B parameter variants, represents a significant milestone in the field of lang...

Adversarial Examples

Crafting inputs that reliably cause model failures - attack techniques, transferability, and robust defense strategies for production AI systems.

AERA Chat: An Interactive Platform for Automated Explainable Student Answer Assessment.

AERA Chat: An Interactive Platform for Automated Exp... - published at EMNLP 2025.

AffectFlow-DINO: Uncertainty-Aware Multi-Task Affect Estimation via Conditional Rectified Flow

We present AffectFlow-DINO, a multi-task learning system for the 11th ABAW challenge that extends a standard deterministic architecture with a condition...

AgensFlow: A Coordination-Policy Substrate for Multi-Agent Systems

Multi-agent systems built on large language models (LLMs) require many coordination choices that are difficult to fix a priori: which skill protocol to...

Agent Communication Protocols

How agents pass information: message formats, schemas, synchronous vs async, routing, error propagation, and tracing through multi-agent systems.

Agent Evaluation

Measuring LLM agent performance through trajectory analysis, benchmark suites, LLM-as-judge, failure taxonomies, and production monitoring strategies.

Agent Explorative Policy Optimization for Multimodal Agentic Reasoning

Vision-language models with extended reasoning succeed on complex problems, but many real-world problems require external tools that internal reasoning...

Agent Safety and Guardrails

Implementing defense-in-depth safety for production LLM agents - prompt injection defense, input/output guardrails, tool sandboxing, HITL confirmation, and audit logging.

Agent vs Chatbot vs Workflow

Precise technical definitions for chatbots, workflows, and AI agents - with decision criteria, cost/reliability tradeoffs, and code examples of all three for the same task.

Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

Large language models are increasingly expected to serve as general-purpose agents that interact with external, stateful tool environments. The Model Co...

AgentCompass: A Unified Evaluation Infrastructure for Agent Capabilities

As Large Language Models (LLMs) evolve into autonomous agents, the need for unified evaluation infrastructure becomes critical. However, current evaluat...

AgentCPM-GUI: Building Mobile-Use Agents with Reinforcement Fine-Tuning.

AgentCPM-GUI: Building Mobile-Use Agents with Reinfo... - published at EMNLP 2025.

AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

Modern open-world agents such as OpenClaw exhibit powerful cross-environment execution capabilities yet introduce broad new safety risk sources. Meanwhi...

AgentFugue: Agent Scaling for Long-Horizon Tasks through Collective Reasoning

Recent progress on long-horizon agentic tasks has been driven largely by scaling up individual agents through stronger models, better tools, and more ef...

AgentGL: Towards Agentic Graph Learning with LLMs via Reinforcement Learning

Large Language Models (LLMs) increasingly rely on agentic capabilities-iterative retrieval, tool use, and decision-making-to overcome the limits of stat...

AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions

Autonomous computer use agents that powered by multimodal large language models (MLLMs) are emerging as capable assistants for completing complex digita...

Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks

We study parallel test-time scaling for long-horizon agentic tasks such as agentic search and deep research, where multiple rollouts are generated in pa...

Agentic AI Systems Should Be Designed as Marginal Token Allocators

This position paper argues that agentic AI systems should be designed and evaluated as marginal token allocation economies rather than as text generator...

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents

Agentic systems are becoming more capable: agents define strategies, take actions, and interact with different environments. This autonomy poses serious...

Agentic Code Editing

How coding agents read, navigate, and surgically modify existing codebases: edit strategies, minimal diffs, regression prevention, and multi-file coordination.

Agentic Design Patterns

The 5 core patterns from Anthropic's research - prompt chaining, routing, parallelization, orchestrator-subagents, and evaluator-optimizer - with full Python implementations.

Agentic RAG

Build RAG systems that reason, iterate, and self-correct - covering Self-RAG, FLARE, ReAct tool-augmented RAG, RAPTOR, and Corrective RAG with full production implementations using the Anthropic SDK.

Agentic RAG

Build agents that control their own retrieval - multi-step reasoning, router agents, ReAct loops, LangGraph stateful pipelines, and production patterns for agentic retrieval systems.

Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

As AI systems move from generating text to accomplishing goals through sustained interaction, the ability to model environment dynamics becomes a centra...

AgentLens: Production-Assessed Trajectory Reviews for Coding Agent Evaluation

We present AgentLens, a production-assessed benchmark for interactive code agents. Most code-agent benchmarks reduce a run to a single bit -- did the ta...

Agents Explore but Agents Ignore: LLMs Lack Environmental Curiosity

LLM-based agents are assumed to integrate environmental observations into their reasoning: discovering highly relevant but unexpected information should...

AgentSearchBench: A Benchmark for AI Agent Search in the Wild

The rapid growth of AI agent ecosystems is transforming how complex tasks are delegated and executed, creating a new challenge of identifying suitable a...

AgentSPEX: An Agent SPecification and EXecution Language

Language-model agent systems commonly rely on reactive prompting, in which a single instruction guides the model through an open-ended sequence of reaso...

AgentSwing: Adaptive Parallel Context Management Routing for Long-Horizon Web Agents

As large language models (LLMs) evolve into autonomous agents for long-horizon information-seeking, managing finite context capacity has become a critic...

Agnostic learning in (almost) optimal time via Gaussian surface area

The complexity of learning a concept class under Gaussian marginals in the difficult agnostic model is closely related to its $L_1$-approximability by l...

AgriIR: A Scalable Framework for Domain-Specific Knowledge Retrieval

This paper introduces AgriIR, a configurable retrieval augmented generation (RAG) framework designed to deliver grounded, domain-specific answers while...

AI Agents Can Already Autonomously Perform Experimental High Energy Physics

Large language model-based AI agents are now able to autonomously execute substantial portions of a high energy physics (HEP) analysis pipeline with min...

AI Co-Mathematician: Accelerating Mathematicians with Agentic AI

We introduce the AI co-mathematician, a workbench for mathematicians to interactively leverage AI agents to pursue open-ended research. The AI co-mathem...

AI Error Handling and Fallbacks

Graceful degradation, retry logic, circuit breakers, fallback model chains, and user-facing error messages for production AI systems.

AI Feature Flags and Rollouts

Safely rolling out AI features with canary deployments, quality-gated rollouts, A/B testing, and kill switches.

AI in Litigation Support

Timeline extraction, deposition analysis, exhibit classification, chronology building, and the AI systems that help litigators prepare and try cases.

AI Product Architecture

End-to-end architecture for a production AI product from API to database.

AI Product Design Principles

Principles for designing AI products that build trust, degrade gracefully, and solve the last-mile problem between model capability and user value.

AI Regulation and FDA Compliance

Regulatory landscape for healthcare AI - FDA SaMD classification, 510(k) vs PMA clearance, EU AI Act, HIPAA compliance for AI, bias auditing, and post-market surveillance for deployed medical AI systems.

AI Research Agents Narrow Scientific Exploration

AI research agents can now generate research ideas, design experiments, run code, and draft papers, raising the possibility of large-scale AI-assisted s...

AI Safety Evaluations

Safety benchmarks, capability evaluations, LLM judges, uplift assessments, and how labs like Anthropic use evaluation-gated deployment through Responsible Scaling Policies.

AI scientists produce results without reasoning scientifically

Large language model (LLM)-based systems are increasingly deployed to conduct scientific research autonomously, yet whether their reasoning adheres to t...

AI Security Governance

Organizational security policies, risk classification frameworks, compliance programs, lifecycle governance, model cards, incident response, and vendor risk management for responsible AI system deployment.

AI-Powered Assessment

Learn how AI systems automatically score essays, grade short answers, generate feedback, detect plagiarism, and audit for bias in educational assessment pipelines.

AIPOM: Agent-aware Interactive Planning for Multi-Agent Systems.

AIPOM: Agent-aware Interactive Planning for Multi-Ag... - published at EMNLP 2025.

Airflow for ML Pipelines

Orchestrate ML training pipelines with Airflow - data quality gates, KubernetesPodOperator training, champion/challenger evaluation, and conditional deployment.

AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation

As reinforcement learning continues to scale the training of large language model-based agents, reliably verifying agent behaviors in complex environmen...

AlayaWorld: Long-Horizon and Playable Video World Generation

Game worlds have traditionally been built through labor-intensive production pipelines, making them costly to develop, difficult to customization, and e...

Alerting and Incident Response for ML

ML-specific alerting design, alert taxonomy, routing with PagerDuty and OpsGenie, on-call runbook design for ML models, post-mortem templates, and reducing MTTD and MTTR for ML incidents.

Alerting on LLM Quality Degradation

Build production alerting systems for LLM quality - threshold alerts, statistical process control, anomaly detection, deployment correlation, runbooks, and Prometheus/Grafana integration.

Aligning What LLMs Do and Say: Towards Self-Consistent Explanations.

Aligning What LLMs Do and Say: Towards Self-Consiste... — published at ACL 2026.

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we...

Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models.

Almieyar-Oryx-BloomBench: A Bilingual Multimodal Ben... — published at ACL 2026.

AlphaTransit: Learning to Design City-scale Transit Routes

Designing a transit network requires many sequential route extension decisions, but their quality is often visible only after the full network is assemb...

Amortized Optimal Transport from Sliced Potentials

We propose a novel amortized optimization method for predicting optimal transport (OT) plans across multiple pairs of measures by leveraging Kantorovich...

Ampere, Hopper, and Ada Architectures

What changed across GPU generations for AI - A100 vs H100 vs H200 vs RTX 4090, NVLink bandwidth, transformer engine, FP8 support, and architecture selection for training and inference.

An adaptive wavelet-based PINN for problems with localized high-magnitude source

In recent years, physics-informed neural networks (PINNs) have gained significant attention for solving differential equations, although they suffer fro...

An Address Intelligence Framework for E-commerce Deliveries.

An Address Intelligence Framework for E-commerce Del... - published at EMNLP 2025.

An automatic counting algorithm for the quantification and uncertainty analysis of the number of microglial cells trainable in small and heterogeneous datasets

Counting immunopositive cells on biological tissues generally requires either manual annotation or (when available) automatic rough systems, for scannin...

An Efficient Unsupervised Federated Learning Approach for Anomaly Detection in Heterogeneous IoT Networks

Federated learning (FL) is an effective paradigm for distributed environments such as the Internet of Things (IoT), where data from diverse devices with...

An Open-Source, Open Data Approach to Activity Classification from Triaxial Accelerometry in an Ambulatory Setting

The accelerometer has become an almost ubiquitous device, providing enormous opportunities in healthcare monitoring beyond step counting or other averag...

An Optimal Transport-driven Approach for Cultivating Latent Space in Online Incremental Learning

In online incremental learning, data continuously arrives with substantial distributional shifts, creating a significant challenge because previous samp...

Analysing LLM Persona Generation and Fairness Interpretation in Polarised Geopolitical Contexts.

Analysing LLM Persona Generation and Fairness Interp... - published at EACL 2026.

Anisotropic Modality Align

Training multimodal large language models has long been limited by the scarcity of high-quality paired multimodal data. Recent studies show that the sha...

Annotation Pipelines

Data labeling workflows, annotation guidelines, inter-annotator agreement, conflict resolution, and quality control for training data that powers AI systems.

Anomaly Detection in Sequences

Master anomaly detection for sequential data - from statistical baselines to LSTM autoencoders. Learn why standard methods fail on time series, how to pick thresholds, and how to build production-grade systems that catch real anomalies without drowning your team in false alarms.

Anomaly Detection on Sensor Data

Learn how to detect anomalies in industrial sensor data using statistical baselines, isolation forests, LSTM autoencoders, multivariate deep learning methods, and real-time streaming architectures.

AnomalyVFM -- Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors

Zero-shot anomaly detection aims to detect and localise abnormal regions in the image without access to any in-domain training images. While recent appr...

Anthropogenic Regional Adaptation in Multimodal Vision-Language Model

While the field of vision-language (VL) has achieved remarkable success in integrating visual and textual information across multiple languages and doma...

ANTIC: Adaptive Neural Temporal In-situ Compressor

The persistent storage requirements for high-resolution, spatiotemporally evolving fields governed by large-scale and high-dimensional partial different...

Anticipate and Learn: Unleashing Idle-Time Compute in Proactive Agents

While AI agents demonstrate remarkable capabilities in reasoning and tool use, they remain fundamentally reactive: they compute responses only after exp...

AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model

Sparse-view 3D reconstruction is essential for modeling scenes from casual captures, but remain challenging for non-generative reconstruction. Existing...

Apache Airflow Architecture

Deep dive into Apache Airflow - DAGs, Scheduler internals, Executors, Operators, XCom, and production patterns for reliable pipeline orchestration.

Apache Airflow for ML

Learn how to use Apache Airflow to orchestrate production ML pipelines - DAG authoring, executors, XCom patterns, and avoiding the most common Airflow pitfalls.

Apache Flink Fundamentals

Apache Flink for stateful stream processing - DataStream API, windows, watermarks, state backends, checkpointing, and PyFlink for ML feature computation.

Apache Hudi

Hudi's copy-on-write vs merge-on-read and upsert patterns.

Apache Iceberg

Iceberg table format, ACID transactions, schema evolution, and time travel.

Apache Kafka Architecture - The Nervous System of Real-Time ML

A deep dive into Kafka's distributed commit log, partitions, replication, consumer groups, compacted topics, and the architectural decisions that make it the standard event transport for production ML systems.

Apache Spark Architecture

How Spark's distributed execution model works - RDDs, DataFrames, DAG planning, Catalyst optimization, and Tungsten execution - explained for engineers building ML data pipelines.

APEX: Large-scale Multi-task Aesthetic-Informed Popularity Prediction for AI-Generated Music

Music popularity prediction has attracted growing research interest, with relevance to artists, platforms, and recommendation systems. However, the expl...

Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images

Recent advances in vision-language models (VLMs) have improved image captioning for cultural heritage. However, inferring structured cultural metadata (...

Apple Silicon for AI

Apple M-series unified memory architecture for ML inference - how the ANE, GPU, and CPU share one memory pool, why this matters for local LLMs, and how to run models with MLX and llama.cpp on Apple Silicon.

Approximate Nearest Neighbor Algorithms

Deep dive into HNSW, IVF, Product Quantization, IVFPQ, LSH, and DiskANN - how each algorithm trades recall for speed and how to choose the right one for your dataset.

Approximation and learning of anisotropic and mixed smooth functions by deep ReLU neural networks

This paper studies how efficiently deep ReLU neural networks can approximate and learn smooth functions. When the error is measured in $L^p([0,1]^d)$ no...

ARDY: Autoregressive Diffusion with Hybrid Representation for Interactive Human Motion Generation

Generating realistic 3D human motions in real-time within interactive applications is key for animation, simulation, and humanoid robotics. While recent...

Are LLMs Ready for Scientific Discovery? A Capability-Oriented Benchmark for AI Scientists

Existing benchmarks for scientific data analysis evaluate LLMs primarily on code execution or workflow completion, overlooking that scientific analysis...

Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study

Despite the growing popularity of Multimodal Domain Generalization (MMDG) for enhancing model robustness, it remains unclear whether reported performanc...

Argumentation and Judgement Factors: LLM-based Discovery and Application in Insurance Disputes.

Argumentation and Judgement Factors: LLM-based Disco... - published at EACL 2026.

ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration

This report describes ARIS (Auto-Research-in-sleep), an open-source research harness for autonomous research, including its architecture, assurance mech...

ARM vs x86 for AI Workloads

Comprehensive comparison of ARM and x86 architectures for ML workloads - ISA design, power efficiency, Apple Silicon unified memory, AWS Graviton3 inference, and performance-per-watt analysis for production AI systems.

Artifact Management & Experiment Organization

Managing ML artifacts at scale - naming conventions, tagging, parent-child relationships, archival policies, and finding the model that became production from 2000 runs.

ArtifactNet: Detecting AI-Generated Music via Forensic Residual Physics

We present ArtifactNet, a lightweight framework that detects AI-generated music by reframing the problem as forensic physics -- extracting and analyzing...

Artificial Intelligence for Detecting Fetal Orofacial Clefts and Advancing Medical Education

Orofacial clefts are among the most common congenital craniofacial abnormalities, yet accurate prenatal detection remains challenging due to the scarcit...

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

Large language models (LLMs), despite being safety-aligned, exhibit brittle refusal behaviors that can be circumvented by simple linguistic changes. As...

ASRank: Zero-Shot Re-Ranking with Answer Scent for Document Retrieval.

ASRank: Zero-Shot Re-Ranking with Answer Scent for D... - published at NAACL 2025.

Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent

The rapid advancement of large language models (LLMs) has enabled powerful authorship inference capabilities, raising growing concerns about unintended...

Assessing Pancreatic Ductal Adenocarcinoma Vascular Invasion: the PDACVI Benchmark

Surgical resection remains the only potentially curative treatment for pancreatic ductal adenocarcinoma (PDAC), and eligibility depends on accurate asse...

Assign and Add: A Mechanistic Study of Compositional Arithmetic

Large language models are able to compose skills in order to perform complex tasks, many of which might not have been seen during training. The details...

Asymptotic and Finite-Time Guarantees for Langevin-Based Temperature Annealing in InfoNCE

The InfoNCE loss in contrastive learning depends critically on a temperature parameter, yet its dynamics under fixed versus annealed schedules remain po...

Async Context Managers

Master async resource management with __aenter__/__aexit__, asynccontextmanager, AsyncExitStack, and production patterns for connection pools and sessions.

Async Generators and Async Iterators

Build streaming data pipelines with async for, async yield, __aiter__/__anext__, async comprehensions, and finalization protocols for production async iteration.

Async LLM Calls

Asynchronous LLM call patterns for high-throughput applications - concurrency control with semaphores, producer-consumer queues, token bucket rate limiting, circuit breakers, and async orchestration patterns.

Async Synchronization Patterns

Implement bounded concurrency, rate limiting, and circuit breakers with asyncio locks, semaphores, events, conditions, and barriers.

AsyncTool: Evaluating the Asynchronous Function Calling Capability under Multi-Task Scenarios

Large language model (LLM)-based agents have shown strong capabilities in using external tools to solve complex tasks. However, existing evaluations oft...

AsySplat: Efficient Asymmetric 3D Gaussian Splatting for Long-Sequence Scene Modeling

Recent generalizable 3D Gaussian Splatting models have advanced long-sequence novel view synthesis (NVS), but at the cost of substantial redundant compu...

ATANT: An Evaluation Framework for AI Continuity

We present ATANT (Automated Test for Acceptance of Narrative Truth), an open evaluation framework for measuring continuity in AI systems: the ability to...

AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency

Large language models (LLMs) increasingly rely on chain-of-thought (CoT) reasoning to solve complex tasks. Yet ensuring that the reasoning trace both co...

Attending to Multimodal Generation One Token at a Time

Multimodal large language models (MLLMs) generate responses autoregressively, integrating visual and linguistic information in an evolving context. Prio...

Attention as Explanation - What Transformers Are (and Aren't) Looking At

When attention weights help explain transformer decisions, when they mislead, and the debate between attention-as-explanation and attention-is-not-explanation.

Attention Is All You Need

The 2017 Vaswani et al. paper that replaced recurrent networks with pure attention - and why it changed everything.

Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

As the foundational architecture of modern machine learning, Transformers have driven remarkable progress across diverse AI domains. Despite their trans...

Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music

We present Audio Flamingo Next (AF-Next), the next-generation and most capable large audio-language model in the Audio Flamingo series, designed to adva...

Audio-Language Models

How modern AI systems process speech and audio - from Whisper's spectrogram-based ASR to end-to-end audio understanding in GPT-4o and Gemini.

Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing

Recent progress in multimodal models has spurred rapid advances in audio understanding, generation, and editing. However, these capabilities are typical...

Audio-Visual Flamingo: Open Audio-Visual Intelligence for Long and Complex Videos

We present Audio-Visual Flamingo (AV-Flamingo), a fully open state-of-the-art audio-visual large language model (AV-LLM) for joint understanding and rea...

Audio-Visual Intelligence in Large Foundation Models

Audio-Visual Intelligence (AVI) has emerged as a central frontier in artificial intelligence, bridging auditory and visual modalities to enable machines...

Augmented Lagrangian Multiplier Network for State-wise Safety in Reinforcement Learning

Safety is a primary challenge in real-world reinforcement learning (RL). Formulating safety requirements as state-wise constraints has become a prominen...

AURA: Always-On Understanding and Real-Time Assistance via Video Streams

Video Large Language Models (VideoLLMs) have achieved strong performance on many video understanding tasks, but most existing systems remain offline and...

Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes

We study auto research as a closed empirical loop driven by external measurement. Each submitted trial carries a hypothesis, an executable code edit, an...

Autoencoders

Neural network autoencoders for unsupervised representation learning - undercomplete, denoising, sparse, contractive variants with PyTorch on MNIST, anomaly detection, and sparse autoencoders for LLM interpretability.

AutoGen Conversational Agents

Microsoft AutoGen v0.4 - event-driven multi-agent runtime, AgentChat teams, code execution, and production patterns for conversational AI systems.

AutoGen Deep Dive

Microsoft AutoGen v0.4: async conversational multi-agent systems, actor model architecture, group chat patterns, and MagenticOne.

Automated Instruction Revision (AIR): A Structured Comparison of Task Adaptation Strategies for LLM

This paper studies Automated Instruction Revision (AIR), a rule-induction-based method for adapting large language models (LLMs) to downstream tasks usi...

Automated Prediction of Postoperative Pancreatic Fistula Using Preoperative Computed Tomography

Postoperative pancreatic fistula (POPF) is a serious complication after pancreatic resection, increasing morbidity, hospital stay, and healthcare costs....

Automated Retraining Pipelines

Build fully automated trigger-based model retraining pipelines - from drift detection through training to production deployment, with human-in-the-loop approval.

Automatically Discovering How Misogyny is Framed on Social Media.

Automatically Discovering How Misogyny is Framed on... - published at NAACL 2025.

Automating Database-Native Function Code Synthesis with LLMs

Database systems incorporate an ever-growing number of functions in their kernels (a.k.a., database native functions) for scenarios like new application...

Automating the Design of Embodied Agent Architectures

Embodied agents are typically built as hand-designed compositions of perception, memory, planning, and action modules. This modularity exposes a large a...

Autoregressive Decoding

Understand how LLMs generate tokens one at a time, why decoding is memory-bandwidth bound, and how to reason about inference latency with the roofline model.

AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery

Scientific research is being reshaped by AI systems that move beyond isolated assistance toward longer-horizon workflows spanning literature grounding,...

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

Autonomous scientific research is significantly advanced thanks to the development of AI agents. One key step in this process is finding the right scien...

Autoscaling ML Workloads

Horizontal Pod Autoscaler, KEDA event-driven autoscaling for GPU metrics, zero-downtime rolling updates with readiness gates, and autoscaling patterns for production ML serving.

AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation

Scientific research proceeds through iterative cycles of hypothesis generation, experiment design, execution, and revision. AI agents can automate parts...

AUTOSUMM: A Comprehensive Framework for LLM-Based Conversation Summarization.

AUTOSUMM: A Comprehensive Framework for LLM-Based Co... - published at ACL 2025.

AvatarPointillist: AutoRegressive 4D Gaussian Avatarization

We introduce AvatarPointillist, a novel framework for generating dynamic 4D Gaussian avatars from a single portrait image. At the core of our method is...

AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation

Text-to-Audio-Video (T2AV) generation is rapidly becoming a core interface for media creation, yet its evaluation remains fragmented. Existing benchmark...

Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

We present JoyAI-Image, a unified multimodal foundation model for visual understanding, text-to-image generation, and instruction-guided image editing....

AWQ In-Depth

How Activation-aware Weight Quantization protects salient weights to achieve near-lossless INT4 compression, and how to deploy AWQ models with AutoAWQ and vLLM.

AWQ: Activation-Aware Weight Quantization

AWQ protects the 1% of weights that matter most - how activation statistics reveal salient weights, how scaling preserves them without extra memory, why AWQ outperforms GPTQ at INT4 for production inference, and how to configure Marlin kernels for maximum throughput.

AWS Data Services

S3, Glue, Athena, EMR, and the AWS data engineering ecosystem.

AWS SageMaker for MLOps

Master the complete AWS SageMaker ecosystem for end-to-end ML workflows - training jobs, pipelines, model registry, feature store, and production inference at scale.

AWS Trainium and Inferentia

Deep dive into AWS custom AI chips - Trainium for training and Inferentia for inference, NeuronCore-v2 architecture, the Neuron SDK compilation pipeline, and real-world cost-performance tradeoffs versus GPU instances.

Axolotl and TRL Training Frameworks

Using Axolotl and HuggingFace TRL for LoRA and QLoRA fine-tuning - configuration files, SFTTrainer, DPO training, and distributed multi-GPU fine-tuning setups.

Azure ML for MLOps

Master the Azure Machine Learning platform for enterprise ML workflows - workspaces, component-based pipelines, managed endpoints, MLflow integration, and responsible AI.

Back to Repair: A Minimal Denoising Network\ for Time Series Anomaly Detection

We introduce JuRe (Just Repair), a minimal denoising network for time series anomaly detection that exposes a central finding: architectural complexity...

Back-of-the-Envelope Estimation for ML Systems

How to estimate storage, compute, memory, and infrastructure requirements for ML systems before writing a line of code - including the 6PD training compute rule and model sizing.

Backdoor Attacks on Decentralised Post-Training

Decentralised post-training of large language models utilises data and pipeline parallelism techniques to split the data and the model. Unfortunately, d...

Backpropagation From Scratch

Full chain rule derivation on computational graphs, Jacobian matrices and vector-Jacobian products, reverse-mode vs forward-mode autodiff, numpy 3-layer MLP implementation, PyTorch custom autograd Functions, and numerical gradient checking - every concept a senior engineer needs to debug, extend, and explain backprop under pressure.

BadWAM: When World-Action Models Dream Right but Act Wrong

World-action models (WAMs) are emerging as a promising foundation for embodied control: rather than predicting actions alone, they learn representations...

Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO

Reinforcement learning with verifiable rewards (RLVR) has become a central paradigm for improving reasoning and code generation in large language models...

Balancing Fidelity and Diversity in Diffusion Models via Symmetric Attention Decomposition: Hopfield Perspective

We characterize the pre-softmax attention matrix QK^top in transformers as an associative memory matrix encoding pairwise associations between input fea...

Balancing Fidelity, Utility, and Privacy in Synthetic Cardiac MRI Generation: A Comparative Study

Deep learning in cardiac MRI (CMR) is fundamentally constrained by both data scarcity and privacy regulations. This study systematically benchmarks thre...

Batch Inference Pipelines

Designing efficient batch inference pipelines for scoring millions of examples - architecture, GPU utilization, failure recovery, and production patterns.

Batch Normalization

Batch normalization mechanics, train vs eval mode pitfalls, loss landscape smoothing theory, Layer Norm, Group Norm, Instance Norm, RMS Norm, pre-norm vs post-norm in transformers, and production freeze patterns - with full PyTorch implementations.

Batch Normalization for Neural Networks on Complex Domains

Riemannian neural networks have proven effective in solving a variety of machine learning tasks. The key to their success lies in the development of pri...

Batch Orchestration Patterns for ML Pipelines

How to orchestrate complex batch ML pipelines with Airflow and modern alternatives, eliminating cron's silent failures, missing dependencies, and zero visibility.

Batch Processing with LLMs

Efficiently processing large document sets with LLM batch APIs - Anthropic Batch API, cost optimization, monitoring, checkpointing, and production patterns for overnight and large-scale LLM workloads.

Batch Processing with Spark for ML Pipelines

How Apache Spark processes terabyte-scale training data - architecture, DataFrames, partitioning, joins, and integration with Delta Lake for ML feature engineering.

Batched Kernelized Bandits: Refinements and Extensions

In this paper, we consider the problem of black-box optimization with noisy feedback revealed in batches, where the unknown function to optimize has a b...

Batching Strategies for Inference

How static, dynamic, and continuous batching work - and how to go from 20% GPU utilization to 85% without increasing latency.

Batching Strategies for LLM Serving

Static batching, dynamic batching, continuous batching, chunked prefill, and prefill-decode disaggregation for LLM inference throughput and latency optimization.

Bayesian Additive Distribution Regression

Distribution regression, where the goal is to predict a scalar response from a distribution-valued predictor, arises naturally in settings where observa...

Bayesian Linear Regression - Uncertainty Estimates for Every Prediction

How placing a prior on linear regression weights gives a full posterior distribution over predictions - with closed-form solutions, predictive uncertainty, and connections to ridge regression.

Bayesian Neural Networks - Uncertainty Quantification for Deep Learning

How to place priors on neural network weights and approximate the posterior with variational inference or Monte Carlo dropout - with production trade-offs.

Bayesian Optimisation - Efficient Hyperparameter Search and Black-Box Optimization

How Bayesian Optimisation uses Gaussian Processes and acquisition functions to find near-optimal hyperparameters in far fewer evaluations than grid or random search - with full Python implementation using BoTorch and Optuna.

Bayesian X-Learner: Calibrated Posterior Inference for Heterogeneous Treatment Effects under Heavy-Tailed Outcomes

Conditional Average Treatment Effect (CATE) estimation in practice demands three properties simultaneously: heterogeneous effects $τ(x)$, calibrated unc...

Behavior-dLDS: A decomposed linear dynamical systems model for neural activity partially constrained by behavior

Brain-wide recordings of large-scale networks of neurons now provide an unprecedented view into how the brain drives behavior. However, brain activity c...

Benchmarking and Building Zero-Shot Hindi Retrieval Model with Hindi-BEIR and NLLB-E5.

Benchmarking and Building Zero-Shot Hindi Retrieval... - published at NAACL 2025.

Benchmarking Composed Image Retrieval for Applied Earth Observation

Remote sensing composed image retrieval (RSCIR) enables search in large satellite image archives using composed queries that combine a reference image w...

Benchmarking Compressed Models

How to systematically evaluate accuracy-efficiency tradeoffs in quantized, pruned, and distilled models - perplexity, task-specific capabilities, latency, throughput, and automated regression detection.

Benchmarking Local Model Performance

Measuring local LLM inference speed - tokens per second, time to first token, memory usage, and systematic comparison across quantization levels, models, and hardware configurations.

Benchmarks: MMLU, HumanEval, and HELM

Navigate the LLM benchmark ecosystem - what each benchmark actually measures, saturation, contamination, and how to build benchmarks that can't be gamed.

Benchmarks: WebArena and OSWorld

Understanding computer use agent benchmarks - WebArena, OSWorld, ScreenSpot, Mind2Web. Current SOTA results, what the numbers mean, and how to evaluate your own agent.

Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs

Prior work shows that fine-tuning aligned models on benign data degrades safety in text and vision modalities, and that proximity to harmful content in...

Better Learning-Augmented Spanning Tree Algorithms via Metric Forest Completion

We present improved learning-augmented algorithms for finding an approximate minimum spanning tree (MST) for points in an arbitrary metric space. Our wo...

BEVLM: Distilling Semantic Knowledge from LLMs into Bird's-Eye View Representations

The integration of Large Language Models (LLMs) into autonomous driving has attracted growing interest for their strong reasoning and semantic understan...

Beyond "Not Novel Enough": Enriching Scholarly Critique with LLM-Assisted Feedback.

Beyond "Not Novel Enough": Enriching Schol... - published at EACL 2026.

Beyond Accuracy: Unveiling Inefficiency Patterns in Tool-Integrated Reasoning

In real-world Tool-Integrated Reasoning (TIR) scenarios, where LLMs interleave reasoning with external tool calls, a major source of inefficiency is tha...

Beyond Additive Decompositions: Interpretability Through Separability

Interpretable machine learning requires models that are accurate and structurally faithful to the data.Existing explainability methods rely heavily on a...

Beyond Augmented-Action Surrogates for Multi-Expert Learning-to-Defer

Learning-to-Defer routes each input to the expert that minimizes expected cost, but it assumes that the information available to every expert is fixed a...

Beyond Distribution Sharpening: The Importance of Task Rewards

Frontier models have demonstrated exceptional capabilities following the integration of task-reward-based reinforcement learning (RL) into their trainin...

Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows

Large Language Models (LLMs) are increasingly deployed as autonomous agents that reason, use tools, and act over multiple steps. Yet most hallucination...

Beyond Gaussian Bottlenecks: Topologically Aligned Encoding of Vision-Transformer Feature Spaces

Modern visual world modeling systems increasingly rely on high-capacity architectures and large-scale data to produce plausible motion, yet they often f...

Beyond Grid Search: Leveraging Bayesian Optimization for Accelerating RAG Pipeline Optimization.

Beyond Grid Search: Leveraging Bayesian Optimization... - published at EACL 2026.

Beyond Hard Negatives: The Importance of Score Distribution in Knowledge Distillation for Dense Retrieval

Transferring knowledge from a cross-encoder teacher via Knowledge Distillation (KD) has become a standard paradigm for training retrieval models. While...

Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems

LLM-based autonomous agents have demonstrated strong capabilities in reasoning, planning, and tool use, yet remain limited when tasks require sustained...

Beyond Mixtures and Products for Ensemble Aggregation: A Likelihood Perspective on Generalized Means

Density aggregation is a central problem in machine learning, for instance when combining predictions from a Deep Ensemble. The choice of aggregation re...

Beyond NNGP: Large Deviations and Feature Learning in Bayesian Neural Networks

We study wide Bayesian neural networks focusing on the rare but statistically dominant fluctuations that govern posterior concentration, beyond Gaussian...

Beyond Perception Errors: Semantic Fixation in Large Vision-Language Models

Large vision-language models (VLMs) often rely on familiar semantic priors, but existing evaluations do not cleanly separate perception failures from ru...

Beyond Prompts: Unconditional 3D Inversion for Out-of-Distribution Shapes

Text-driven inversion of generative models is a core paradigm for manipulating 2D or 3D content, unlocking numerous applications such as text-based edit...

Beyond Retrieval: A Multitask Benchmark and Model for Code Search

Code search has usually been evaluated as first-stage retrieval, even though production systems rely on broader pipelines with reranking and developer-s...

Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction

Modern retrieval systems, whether lexical or semantic, expose a corpus through a fixed similarity interface that compresses access into a single top-k r...

Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL

The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforc...

Beyond Single Tokens: Distilling Discrete Diffusion Models via Discrete MMD

It is currently difficult to distill discrete diffusion models. In contrast, continuous diffusion literature has many distillation approaches methods th...

Beyond Stochastic Exploration: What Makes Training Data Valuable for Agentic Search

Reinforcement learning (RL) has become an effective approach for advancing the reasoning capabilities of large language models (LLMs) through the strate...

Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models

Native Omni-modal Large Language Models (OLLMs) have shifted from pipeline architectures to unified representation spaces. However, this native integrat...

Bibby AI: An Editor-Native Agentic Platform for Academic Research, Writing, and Publishing

Academic output is produced across a fragmented toolchain: literature discovery in one application, reference management in another, writing in a LaTeX...

BidirLM: From Text to Omnimodal Bidirectional Encoders by Adapting and Composing Causal LLMs

Transforming causal generative language models into bidirectional encoders offers a powerful alternative to BERT-style architectures. However, current a...

BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models

Despite the success of large language models (LLMs) on general-purpose tasks, their performance in highly specialized domains such as biomedicine remain...

BlenderRAG: High-Fidelity 3D Object Generation via Retrieval-Augmented Code Synthesis

Automatic generation of executable Blender code from natural language remains challenging, with state-of-the-art LLMs producing frequent syntactic error...

BLEU, ROUGE, and Generation Metrics

Master reference-based generation metrics - BLEU, ROUGE, BERTScore, BLEURT - and know exactly when each one lies to you.

Blind-Spots-Bench: Evaluating Blind Spots in Multimodal Models

Modern AI models achieve strong performance on many established benchmarks, yet they still fail on tasks that humans find almost trivial, such as manipu...

BLISSNet: Deep Operator Learning for Fast and Accurate Flow Reconstruction from Sparse Sensor Measurements

Reconstructing fluid flows from sparse sensor measurements is a fundamental challenge in science and engineering. Widely separated measurements and comp...

BMdataset: A Musicologically Curated LilyPond Dataset

Symbolic music research has relied almost exclusively on MIDI-based datasets; text-based engraving formats such as LilyPond remain unexplored for music...

Boogu-Image-0.1: Boosting Open-Source Unified Multimodal Understanding and Generation

We introduce Boogu-Image-0.1, an open-source unified multimodal understanding and generation model family, comprising Base, Turbo, Edit, and Edit-Turbo...

BOOKCOREF: Coreference Resolution at Book Scale.

BOOKCOREF: Coreference Resolution at Book Scale. - published at ACL 2025.

BOOKMARKS: Efficient Active Storyline Memory for Role-playing

Memory systems are critical for role-playing agents (RPAs) to maintain long-horizon consistency. However, existing RPA memory methods (e.g., profiling)...

Boosting deep Reinforcement Learning using pretraining with Logical Options

Deep reinforcement learning agents are often misaligned, as they over-exploit early reward signals. Recently, several symbolic approaches have addressed...

Boosting Visual Instruction Tuning with Self-Supervised Guidance

Multimodal large language models (MLLMs) perform well on many vision-language tasks but often struggle with vision-centric problems that require fine-gr...

BoSS: A Best-of-Strategies Selector as an Oracle for Deep Active Learning

Active learning (AL) aims to reduce annotation costs while maximizing model performance by iteratively selecting valuable instances. While foundation mo...

BracketRank: Large Language Model Document Ranking via Reasoning-based Competitive Elimination.

BracketRank: Large Language Model Document Ranking v... — published at ACL 2026.

Breaking the Chains of Probability: Neutrosophic Logic as a New Framework for Epistemic Uncertainty in Large Language Models

Large Language Models (LLMs) are predominantly governed by probabilistic frameworks in which the sum of outcome probabilities is constrained to unity. T...

Breaking the Tuning Barrier: Zero-Hyperparameters Yield Multi-Corner Analysis Via Learned Priors

Yield Multi-Corner Analysis validates circuits across 25+ Process-Voltage-Temperature corners, resulting in a combinatorial simulation cost of $O(K im...

Bridging Attribution and Open-Set Detection using Graph-Augmented Instance Learning in Synthetic Speech.

Bridging Attribution and Open-Set Detection using Gr... - published at EACL 2026.

Broadening Access to Transportation Safety Data with Generative AI: A Schema-Grounded Framework for Spatial Natural Language Queries

Transportation safety analysis requires integrating crash records, roadway attributes, and geospatial data through GIS-based workflows, but access remai...

Browser Agents

Building practical browser agents using Playwright and LLMs - DOM manipulation, visual navigation, session management, anti-bot handling, and complete Python implementation.

Budget-Sensitive Discovery Scoring: A Formally Verified Framework for Evaluating AI-Guided Scientific Selection

Scientific discovery increasingly relies on AI systems to select candidates for expensive experimental validation, yet no principled, budget-aware evalu...

Build Systems and CI/CD for ML

How build systems and CI/CD pipelines keep ML projects reproducible, tested, and safely deployable - covering Make, Bazel, DVC, MLflow, GitHub Actions, and canary deployments.

Build vs Buy Analysis

A rigorous financial and risk framework for deciding when to build ML infrastructure in-house vs use managed services - applied to feature stores, vector DBs, LLMs, and more.

Build vs. Buy Economics for ML Tools

Economic analysis for ML tooling decisions - TCO framework, self-hosted vs. managed analysis, hidden costs of self-hosting, and a full financial case for W&B vs. MLflow.

Building an Evaluation Harness

Building a production evaluation harness for LLMs - lm-evaluation-harness architecture, custom task integration, CI/CD evaluation gates, versioned evaluation datasets, and automated regression detection.

Building an MCP Server

Hands-on guide to building a production-quality MCP filesystem server in Python using the official MCP SDK - complete with 4 tools, resources, MCP Inspector testing, and Claude Desktop integration.

Building Embedding Pipelines

Design production embedding pipelines - model selection, batch ingestion, incremental indexing, zero-downtime model upgrades, embedding drift detection, normalization, and dimensionality reduction.

Building Golden Datasets

Learn how to construct, annotate, validate, and maintain golden datasets that serve as the ground truth foundation for all AI system evaluation - covering annotation guidelines, inter-annotator agreement, adversarial generation, dataset versioning, and drift detection.

Building Your Own Coding Agent

Build a complete, functional coding agent from scratch in Python. Architecture decisions, repo maps, context management, system prompts, safety, and the full 500-line agent.

Bytecode Inspection - Inside the code Object

Understand Python bytecode and the code object at engineering depth - all co_ attributes explained, how .pyc files work, reading bytecode with marshal, the line number table, closures in bytecode, and practical uses in debuggers and test frameworks.

C and C++ for ML Systems

Learn why C and C++ form the foundation of every major ML framework, and how to read, write, and debug C++ code as an ML systems engineer.

C Extensions and FFI - When Python Isn't Fast Enough

Master ctypes, cffi, Cython, and pybind11 for calling C/C++ from Python - loading shared libraries, writing CPython extensions, and accelerating hot paths with compiled code.

C-GenReg: Training-Free 3D Point Cloud Registration by Multi-View-Consistent Geometry-to-Image Generation with Probabilistic Modalities Fusion

We introduce C-GenReg, a training-free framework for 3D point cloud registration that leverages the complementary strengths of world-scale generative pr...

C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences

Rubric-augmented verification guides reward models with explicit evaluation criteria, yielding more reliable judgments than single-model verification. H...

Caching for ML Serving

How to reduce ML serving cost and latency with result caching, semantic similarity caching, KV cache for transformers, prefix caching for LLMs, and feature caching with Redis.

Caching Strategies

Four caching layers for LLM applications - exact match, semantic similarity, provider prefix caching, and KV cache - with implementation patterns and production tradeoffs.

Caching Strategies - Trading Memory for Speed

Master functools.lru_cache, functools.cache, TTL caches, memoization patterns, cache invalidation, cachetools, Redis caching, and cache stampede prevention.

Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling

Speculative sampling (SpS) has been successful in accelerating the decoding throughput of auto-regressive large language models by leveraging smaller dr...

Can Dialects Be Steered Like Languages? Sparse Neurons and Distributed Directions in Arabic LLMs

A key challenge in Arabic NLP is the scarcity of dialectal data relative to Modern Standard Arabic (MSA), causing LLMs to overproduce MSA and struggle w...

Can LLMs Introspect? A Reality Check

Can large language models detect and report their own internal states? A number of studies have argued that the answer to this question is yes. We argue...

Can LLMs Learn to Reason Robustly under Noisy Supervision?

Reinforcement Learning with Verifiable Rewards (RLVR) effectively trains reasoning models that rely on abundant perfect labels, but its vulnerability to...

Can Natural Image Autoencoders Compactly Tokenize fMRI Volumes for Long-Range Dynamics Modeling?

Modeling long-range spatiotemporal dynamics in functional Magnetic Resonance Imaging (fMRI) remains a key challenge due to the high dimensionality of th...

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

Reinforcement learning (RL) has been applied to improve large language model (LLM) reasoning, yet the systematic study of how training scales with task...

Canary and Blue-Green Deployments for ML Models

Safe model rollout strategies - canary deployments for gradual traffic migration, blue-green for instant switch, and automated rollback triggers.

CanvasAgent: Enabling Complex Image Creation and Editing via Visual Tool Orchestration

Complex image creation and editing often require more than a single generation or editing model. A user request may involve synthesizing images, localiz...

Cards Against Contamination: TCG-Bench for Difficulty-Scalable Multilingual LLM Reasoning.

Cards Against Contamination: TCG-Bench for Difficult... - published at EACL 2026.

Cascade and Funnel Architecture

How multi-stage ranking systems reduce millions of candidates to a final ranked list within strict latency budgets - the architecture behind every major search and recommendation system.

CASCADE: Case-Based Continual Adaptation for Large Language Models During Deployment

Large language models (LLMs) have become a central foundation of modern artificial intelligence, yet their lifecycle remains constrained by a rigid sepa...

Case Studies: Production LLM Systems

Five detailed production LLM architectures - GitHub Copilot, Notion AI, customer support bots, enterprise RAG, and code review agents - with real architecture decisions, scale numbers, and lessons learned.

Case-Grounded Evidence Verification: A Framework for Constructing Evidence-Sensitive Supervision

Evidence-grounded reasoning requires more than attaching retrieved text to a prediction: a model should make decisions that depend on whether the provid...

Category-Level 3D Correspondence in Camera Space via Morphable Object Priors

Understanding 3D objects from images is fundamental to robotics and AR/VR applications. While recent work has made progress in category-level pose estim...

Causal Cellular Context Transfer Learning (C3TL): An Efficient Architecture for Prediction of Unseen Perturbation Effects

Predicting the effects of chemical and genetic perturbations on quantitative cell states is a central challenge in computational biology, molecular medi...

Causal Interpretation of Neural Network Computations with Contribution Decomposition

Understanding how neural networks transform inputs into outputs is crucial for interpreting and manipulating their behavior. Most existing approaches an...

Causal Language Modeling and GPT

Learn how GPT-style autoregressive models work, the evolution from GPT-1 to GPT-4, sampling strategies, and why causal LM became the dominant paradigm for LLMs.

CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates bo...

CausalDS: Benchmarking Causal Reasoning in Data-Science Agents

Large language models (LLMs) increasingly act as integrated data-science agents, combining abstract reasoning with advanced tool use. Yet the relevant b...

Causality Elicitation from Large Language Models

Large language models (LLMs) are trained on enormous amounts of data and encode knowledge in their parameters. We propose a pipeline to elicit causal re...

Cerebras Wafer Scale Engine

How Cerebras builds the world's largest chip by using the entire silicon wafer as one device, eliminating inter-chip communication overhead for large model training and delivering linear scaling without distributed training frameworks.

Certified and accurate computation of function space norms of deep neural networks

Neural network methods for PDEs require reliable error control in function space norms. However, trained neural networks can typically only be probed at...

CFSP: An Efficient Structured Pruning Framework for LLMs with Coarse-to-Fine Activation Information.

CFSP: An Efficient Structured Pruning Framework for... - published at COLING 2025.

CGGS: Consistency-Augmented Geometric Gaussian Splatting for Ego-centric 3D Scene Generation

Challenges remain in ego-centric 3D scene generation due to limited view overlap and the dominant influence of individual perspectives on scene interpre...

Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation

Iterative Retrieval-Augmented Generation (iRAG) has emerged as a powerful paradigm for answering complex multi-hop questions by progressively retrieving...

Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs

Multimodal Reasoning Models (MRMs) leveraging Chain-of-Thought (CoT) based thinking have revolutionized mathematical and logical problem-solving. Howeve...

Chain-of-Thought Prompting

Learn how to unlock multi-step reasoning in LLMs by making them think out loud - and why this simple technique dramatically improves accuracy on complex tasks.

Chain-of-Thought Reasoning at Inference Time

How chain-of-thought prompting transforms model reasoning - from the Wei et al. 2022 breakthrough to self-consistency, process supervision, and the faithfulness problem.

Challenges of Evaluating Agents

Why evaluating agentic systems is fundamentally harder than evaluating static models - the multi-path problem, compound errors, latent failures, and how to build an evaluation mindset.

Channel-wise Vector Quantization

We present Channel-wise Vector Quantization (CVQ), a novel image tokenization paradigm that replaces patch-wise tokens with channel-wise tokens. Unlike...

Characterization of Gaussian Universality Breakdown in High-Dimensional Empirical Risk Minimization

We study high-dimensional convex empirical risk minimization (ERM) under general non-Gaussian data designs. By heuristically extending the Convex Gaussi...

Chasing the Public Score: User Pressure and Evaluation Exploitation in Coding Agent Workflows

Frontier coding agents are increasingly used in workflows where users supervise progress primarily through repeated improvement of a public score, namel...

Chat2Scenic: An Iterative RAG-Based Framework for Scenario Generation in Autonomous Driving

Validating autonomous driving systems requires diverse, regulation-compliant test scenarios. In simulation-based testing, scenarios are defined as execu...

Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language

At present, executable visual workflows have emerged as a mainstream paradigm in real-world industrial deployments, offering strong reliability and cont...

Chem-PerturBridge: a harmonized compendium of small molecule perturbation transcriptomic effects

Large perturbation models require training data encompassing chemical, cellular, and assay diversity. Current transcriptomic resources for small-molecul...

ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follow...

Choosing an Orchestrator

A decision framework for selecting the right ML pipeline orchestrator - comparing Airflow, Prefect, Kubeflow Pipelines, Metaflow, ZenML, and Dagster across team size, maturity, and infrastructure requirements.

Choosing an Orchestrator for Your AI Data Stack

What Airflow, Prefect, Dagster, and Temporal each do for AI systems, when your ML pipeline complexity and team maturity dictate which orchestrator fits best, and how to apply a structured decision framework to select the right tool for production AI data pipelines.

Choosing Custom Silicon vs GPUs

A complete decision framework for AI accelerator selection - how to evaluate NVIDIA GPUs, TPUs, Trainium, Gaudi, Groq, and custom ASICs across workload fit, TCO, ecosystem maturity, and team capability.

Chunk-wise Attention Transducers for Fast and Accurate Streaming Speech-to-Text

We propose Chunk-wise Attention Transducer (CHAT), a novel extension to RNN-T models that processes audio in fixed-size chunks while employing cross-att...

CI/CD for ML

Build automated CI/CD pipelines for machine learning - from unit tests on transforms to canary deployments - so model degradation gets caught before it reaches users.

CI/CD for ML vs Software

Understand why standard software CI/CD is insufficient for ML and what additional stages you need to catch real failures.

CineMobile: On-Device Image-to-Video Diffusion for Cinematic Camera Motion Generation

The growing demand for image-to-video creation on mobile devices has increasingly focused on cinematic motion effects like bullet time, dolly zoom, slow...

CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

We address the problem of generating a 3D-consistent, navigable environment that is spatially grounded: a simulation of a real location. Existing video...

Classes and Objects - Python's Object Model at Engineering Depth

Understand Python classes and objects at the engineering level - class vs instance namespace, attribute resolution, type as metaclass, class body execution, and the shared mutable attribute trap.

Classifier-Free Guidance - Steering Diffusion with Text

Complete derivation of CFG from classifier guidance through the Ho-Salimans implicit classifier insight - the guidance scale trade-off, negative prompting mechanics, dynamic thresholding, CFG++ variants, and production sampling implementations.

Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World

Large language model agents are increasingly envisioned as always-on personal assistants with access to anything relevant in the user's digital world. Y...

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. However, existi...

ClawArena: Benchmarking AI Agents in Evolving Information Environments

AI agents deployed as persistent assistants must maintain correct beliefs as their information environment evolves. In practice, evidence is scattered a...

ClawBench: Can AI Agents Complete Everyday Online Tasks?

AI agents may be able to automate your inbox, but can they automate other routine aspects of your life? Everyday online tasks offer a realistic yet unso...

ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

Constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale. We argue that what...

ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents

GUI agents drive applications through their visual interfaces instead of programmatic APIs, interacting with arbitrary software via taps, swipes, and ke...

ClawNet: Human-Symbiotic Agent Network for Cross-User Autonomous Cooperation

Current AI agent frameworks have made remarkable progress in automating individual tasks, yet all existing systems serve a single user. Human productivi...

ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

Large language model (LLM) agents are increasingly deployed to automate productivity tasks (e.g., email, scheduling, document management), but evaluatin...

Clean Architecture - Dependencies Point Inward

Implement Uncle Bob's Clean Architecture in Python with proper layering, the dependency rule, domain models, service layers, repositories, and framework boundaries.

CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models

Image degradation from blur, noise, compression, and poor illumination severely undermines multimodal understanding in real-world settings. Unified mult...

Clinical NLP and EHR Systems

Building NLP pipelines on Electronic Health Records - named entity recognition for clinical text, negation detection, de-identification for HIPAA compliance, and fine-tuning BERT variants on medical corpora.

CLIP and Contrastive Learning

How CLIP uses contrastive learning on 400M image-text pairs to build a shared semantic embedding space - enabling zero-shot classification without labeled data.

CLoPA: Continual Low Parameter Adaptation of Interactive Segmentation for Medical Image Annotation

Interactive segmentation enables clinicians to guide annotation, but existing zero-shot models like nnInteractive fail to consistently reach expert-leve...

Closures Deep Dive - Free Variables, Cell Objects, and nonlocal

Master Python closures at CPython depth - free variables, cell objects, __closure__, co_freevars, the UnboundLocalError trap, the nonlocal keyword, late binding, factory functions, memoization, and when to use a closure vs a class.

Cloud Cost Management

Implement full FinOps practice for ML teams - from commitment-based discounts and tagging strategies to budget alerts and spot instance automation.

Cloud FinOps for ML

Financial operations for ML cloud spend - FinOps maturity model, reserved instances, spot strategy, multi-account cost attribution, and ML budget forecasting.

Cloud ML Cost Optimization

Master cloud cost management for ML workloads - spot instance strategies, storage optimization, inference cost reduction, FinOps tooling, and real-world cost reduction from $80K to $31K/month.

Cloud vs On-Prem GPU Infrastructure

Total cost of ownership analysis for cloud GPU instances vs on-premises clusters, break-even analysis, spot instance economics, Kubernetes GPU scheduling, and FinOps strategies for GPU compute at scale.

CNN Architectures - AlexNet to ResNet, EfficientNet, and ConvNeXt

The full evolution of CNN architectures from handcrafted features to AlexNet, VGG, GoogLeNet, ResNet, EfficientNet, and ConvNeXt - with the engineering story behind every breakthrough.

Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

Long horizon interactive environments are a testbed for evaluating agents skill usage abilities. These environments demand multi step reasoning, the cha...

CocoaBench: Evaluating Unified Digital Agents in the Wild

LLM agents now perform strongly in software engineering, deep research, GUI automation, and various other applications, while recent agent scaffolds and...

Code and Math Specialized Models

How domain-specific pre-training and fine-tuning on code and math data produces models that outperform general LLMs on programming and reasoning tasks - and when to use them in production.

Code Coverage - Measuring What You Test (and What You Miss)

Master code coverage at engineering depth - line vs branch vs condition coverage, coverage.py internals with sys.settrace, pytest-cov, .coveragerc configuration, pragma no cover, coverage in CI, and mutation testing with mutmut to find tests that pass but don't catch bugs.

Code Generation Evaluation

Evaluating LLMs on code generation tasks - HumanEval, MBPP, LiveCodeBench, SWE-bench, pass@k metric, EvalPlus, execution-based evaluation, security testing, and building sandboxed evaluation environments.

Code World Model Preparedness Report

This report documents the preparedness assessment of Code World Model (CWM), a model for code generation and reasoning about code from Meta. We conducte...

Code-Switching Information Retrieval: Benchmarks, Analysis, and the Limits of Current Retrievers

Code-switching is a pervasive linguistic phenomenon in global communication, yet modern information retrieval systems remain predominantly designed for,...

CodeGenWrangler: Data Wrangling task automation using Code-Generating Models.

CodeGenWrangler: Data Wrangling task automation usin... - published at NAACL 2025.

CodeTaxo: Enhancing Taxonomy Expansion with Limited Examples via Code Language Prompts.

CodeTaxo: Enhancing Taxonomy Expansion with Limited... - published at ACL 2025.

CodeTracer: Towards Traceable Agent States

Code agents are advancing rapidly, but debugging them is becoming increasingly difficult. As frameworks orchestrate parallel tool calls and multi-stage...

Coevolving Representations in Joint Image-Feature Diffusion

Joint image-feature generative modeling has recently emerged as an effective strategy for improving diffusion training by coupling low-level VAE latents...

Cog-DRIFT: Exploration on Adaptively Reformulated Instances Enables Learning from Hard Reasoning Problems

Reinforcement learning from verifiable rewards (RLVR) has improved the reasoning abilities of LLMs, yet a fundamental limitation remains: models cannot...

Cognitive Kernel: An Open-source Agent System towards Generalist Autopilots.

Cognitive Kernel: An Open-source Agent System toward... - published at NAACL 2025.

CoHyDE: Iterative Co-Training of LLM Rewriter & Dense Encoder for Tool Retrieval

Tool retrieval over large API catalogs is a core bottleneck for LLM agents: user queries arrive in colloquial, often underspecified language, while the...

CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

Synthesizing human--object interaction (HOI) videos has broad practical value in e-commerce, digital advertising, and virtual marketing. However, curren...

COLD-Steer: Steering Large Language Models via In-Context One-step Learning Dynamics

Activation steering methods enable inference-time control of large language model (LLM) behavior without retraining, but current approaches face a funda...

Collaborative Filtering - How Netflix Knows You Better Than You Know Yourself

Learn how user-based and item-based collaborative filtering work from first principles - the math behind cosine similarity and Pearson correlation, how Amazon's item-to-item CF changed the industry, and how to build production-grade recommendation engines.

COLLEAGUE.SKILL: Automated AI Skill Generation via Expert Knowledge Distillation

LLM agents are increasingly expected not only to complete isolated tasks, but also to carry bounded representations of human expertise, judgment, and in...

CollectionLoRA: Collecting 50 Effects in 1 LoRA via Multi-Teacher On-Policy Distillation

Customized image editing aims to equip pre-trained diffusion models with specific visual effects using limited paired data, typically via Low-Rank Adapt...

Collective Kernel EFT for Pre-activation ResNets

In finite-width deep neural networks, the empirical kernel $G$ evolves stochastically across layers. We develop a collective kernel effective field theo...

Colored Noise Diffusion Sampling

Diffusion models achieve state-of-the-art image synthesis, with their generative trajectories fundamentally exhibiting a spectral bias, resolving low-fr...

Coloring the Noise: Adversarial Sobolev Alignment for Faithful Image Super Resolution

Generative priors in Image Super-Resolution (SR) often compromise faithful restoration, we attribute this limitation to a fundamental spectral misalignm...

Combee: Scaling Prompt Learning for Self-Improving Language Model Agents

Recent advances in prompt learning allow large language model agents to acquire task-relevant knowledge from inference-time context without parameter ch...

ComboStoc: Combinatorial Stochasticity for Diffusion Generative Models

In this paper, we study an under-explored but important factor of diffusion generative models, i.e., the combinatorial complexity. Data samples are gene...

Comparing and Selecting Models

Systematic model comparison and selection - metric design, statistical significance testing, champion-challenger frameworks, and making defensible production promotion decisions.

Comparing Classical and Quantum Variational Classifiers on the XOR Problem

Quantum machine learning applies principles such as superposition and entanglement to data processing and optimization. Variational quantum models opera...

COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling

Large language models (LLMs) often exhibit performance disparities across languages, with naive multilingual fine-tuning frequently degrading performanc...

Competition-Aware CPC Forecasting with Near-Market Coverage

Cost-per-click (CPC) in paid search is a volatile auction outcome generated by a competitive landscape that is only partially observable from any single...

Complexity Analysis for ML Engineers

Learn how Big-O notation, time and space complexity, and amortized analysis apply directly to ML systems - from understanding why O(n^2) attention broke transformers to profiling GPU kernels.

Compliance Monitoring Systems

Regulatory change detection, gap analysis automation, policy compliance checking, and building AI systems that track regulatory requirements across jurisdictions.

Compliance versus Sensibility: On the Reasoning Controllability in Large Language Models

Large Language Models (LLMs) are known to acquire reasoning capabilities through shared inference patterns in pre-training data, which are further elici...

Composition vs Inheritance - When to Use Each at Engineering Depth

Master the is-a vs has-a distinction, understand why "favour composition over inheritance" exists, implement the delegation pattern, use mixins, refactor inheritance to composition, and apply dependency injection with typing.Protocol for structural typing.

Compositional Generalization Requires Linear, Orthogonal Representations in Vision Embedding Models

Compositional generalization, the ability to recognize familiar parts in novel contexts, is a defining property of intelligent systems. Although modern...

Comprehensive Benchmarking of Long-Form Speech Generation in Diverse Scenarios

Recent advances in speech generation have enabled high-fidelity synthesis, yet systematic evaluation of models under long-context conditions remains lar...

Compress to Impress: Unleashing the Potential of Compressive Memory in Real-World Long-Term Conversations.

Compress to Impress: Unleashing the Potential of Com... - published at COLING 2025.

Computer Use Architecture

How Anthropic's Computer Use API works - the screenshot-action loop, the three tools, coordinate systems, and building a working computer use agent with Docker.

Computer Vision for Quality Control

Learn how AI-powered visual inspection systems detect manufacturing defects using anomaly detection, semantic segmentation, and real-time inline inspection pipelines.

Computer Vision Systems

Production computer vision at scale - autonomous vehicle perception with 30 cameras at 100Hz, real-time object detection, model compression for edge, active learning, and quality metrics.

Computing Equilibrium beyond Unilateral Deviation

Most familiar equilibrium concepts, such as Nash and correlated equilibrium, guarantee only that no single player can improve their utility by deviating...

Concentration and Calibration in Predictive Bayesian Inference

Predictive Bayesian inference (PBI) represents a model-and prior-agnostic approach to standard Bayesian inference which allows users to quantify uncerta...

Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding

Vision-Language Models demonstrate remarkable capabilities but often struggle with compositional reasoning, exhibiting vulnerabilities regarding word or...

Concurrency Primitives

Master mutexes, condition variables, atomics, lock-free programming, and thread pools - the concurrency building blocks behind every high-throughput ML data pipeline and inference server.

Concurrent Image Understanding and Generation: Self-Correcting Coupled Markov Jump Processes

Human cognition does not separate understanding and generation. A teacher at a whiteboard speaks and draws together, each modality reshapes the other. I...

Conditioning Protein Generation via Hopfield Pattern Multiplicity

Protein sequence generation via stochastic attention produces plausible family members from small alignments without training, but treats all stored seq...

CONF-KV: Confidence-Aware KV Cache Eviction with Mixed-Precision Storage for Long-Horizon LLM

Long-horizon LLM inference turns the key--value (KV) cache into the dominant GPU memory consumer and makes per-token attention increasingly expensive. M...

Confidence and Calibration of Activation Oracles for Reliable Interpretation of Language Model Internals

Activation oracles aim to make the activations of other models legible to humans and yield promising results compared to white-box interpretability tech...

Configuration Management - Environment-Driven Apps

Externalize and validate application configuration with python-dotenv, pydantic-settings, secrets management, multi-environment configs, and the 12-factor config principle.

Conformal Prediction - Distribution-Free Uncertainty with Guaranteed Coverage

Conformal prediction constructs prediction sets with provable finite-sample coverage guarantees under only the exchangeability assumption - no distributional assumptions required. Complete Python implementation for classification and regression.

Conformalized Neural Networks for Federated Uncertainty Quantification under Dual Heterogeneity

Federated learning (FL) faces challenges in uncertainty quantification (UQ). Without reliable UQ, FL systems risk deploying overconfident models at unde...

CONSCIENTIA: Can LLM Agents Learn to Strategize? Emergent Deception and Trust in a Multi-Agent NYC Simulation

As large language models (LLMs) are increasingly deployed as autonomous agents, understanding how strategic behavior emerges in multi-agent environments...

Consistency and Availability in ML Systems

How CAP theorem, eventual consistency, and training-serving skew apply to ML systems - feature stores, model versioning, multi-region serving, and when consistency actually matters.

Consolidating Rewarded Perturbations for LLM Post-Training

Post-training of language models is commonly framed as a sample-score-update loop implemented by gradient descent. A recent line of work, exemplified by...

Constitutional AI

How Anthropic replaced human feedback with AI feedback guided by explicit principles - the Constitutional AI technique, RLAIF, and how it enables scalable alignment.

Constrained Decoding - How It Works

The mathematics of constrained decoding - finite-state machines, token masking, context-free grammars, and how the Outlines library achieves guaranteed JSON schema conformance at generation time.

Container Registry and CI

Manage ML container images in CI/CD pipelines - registry choices, image tagging, multi-architecture builds, Trivy scanning, and environment promotion workflows.

Containers and Namespaces

How Linux namespaces, cgroups, and overlay filesystems power container isolation for multi-tenant ML serving, GPU workloads, and reproducible training environments.

Content Generation for Education

Learn how LLMs generate educational content - questions, explanations, worked examples, and quizzes - with quality control, Bloom's taxonomy alignment, and hallucination mitigation.

Content-Based Filtering - Recommending by What Items Are Made Of

Learn how content-based filtering builds item feature vectors, constructs user profiles, and scores unseen items using TF-IDF and cosine similarity - no user overlap required.

Context Compression Techniques

How LLMLingua, AutoCompressors, GIST tokens, and selective compression reduce long contexts to fewer tokens while preserving the information needed to answer queries.

Context Management at Scale

Managing context windows, conversation history, and state across sessions - sliding window, summarization compression, hierarchical context, KV cache management, and context budget allocation for production LLM systems.

Context Unrolling in Omni Models

We present Omni, a unified multimodal model natively trained on diverse modalities, including text, images, videos, 3D geometry, and hidden representati...

Context Window Extension - YaRN, LongRoPE, LongLoRA

How position interpolation, NTK-aware scaling, YaRN, and LongLoRA extend pretrained models to context windows far beyond their original training length.

Context Window Management

Engineering strategies for managing context windows in production LLM applications - history truncation, compression, RAG ordering, and prompt caching design.

Context-Value-Action Architecture for Value-Driven Large Language Model Agents

Large Language Models (LLMs) have shown promise in simulating human behavior, yet existing agents often exhibit behavioral rigidity, a flaw frequently m...

Continual Learning and Domain Adaptation

Learn how to adapt open-source language models to specialized domains through continual pre-training, manage catastrophic forgetting with EWC and data mixing, and evaluate domain knowledge gain versus general capability loss.

Continuous Adversarial Flow Models

We propose continuous adversarial flow models, a type of continuous-time flow model trained with an adversarial objective. Unlike flow matching, which u...

Continuous Batching

Learn how continuous batching eliminates GPU idle time by replacing finished sequences immediately rather than waiting for the longest request in a batch to complete.

Continuous Eval in CI/CD

Design and implement a full CI/CD pipeline for AI systems - covering PR-level linting, merge-level regression, pre-deployment evaluation gates, production monitoring with statistical process control, anomaly detection, automated rollback, and observability tracing from query to feedback.

Continuous Latent Diffusion Language Model

Large language models have achieved remarkable success under the autoregressive paradigm, yet high-quality text generation need not be tied to a fixed l...

Continuous Orthogonal Mode Decomposition: Haptic Signal Prediction in Tactile Internet

The Tactile Internet demands sub-millisecond latency and ultra-high reliability, as high latency or packet loss could lead to haptic control instability...

Continuous Training

Design continuous training systems that safely update models every few hours - covering CT maturity levels, warm-starting, failure modes, and monitoring.

Continuous-Time Distribution Matching for Few-Step Diffusion Distillation

Step distillation has become a leading technique for accelerating diffusion models, among which Distribution Matching Distillation (DMD) and Consistency...

Contract Analysis and NLP

Clause extraction, obligation detection, risk identification, and building NLP systems for commercial contract analysis at law firm and enterprise scale.

Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks

Interpretability tools are increasingly used to analyze failures of Large Language Models (LLMs), yet prior work largely focuses on short prompts or toy...

Controllable Style Arithmetic with Language Models.

Controllable Style Arithmetic with Language Models. - published at ACL 2025.

ControlLight: Towards Controllable, Consistent, and Generalizable Low-Light Enhancement

Existing deep learning-based low-light enhancement methods are typically trained on limited datasets with single enhancement targets, which restricts th...

Convergence of Two-Timescale Markovian Stochastic Approximations with Applications in Reinforcement Learning

This work studies the convergence of two-timescale stochastic approximations (SA), a class of iterative algorithms that update two sets of parameters in...

Convergent Evolution: How Different Language Models Learn Similar Number Representations

Language models trained on natural text learn to represent numbers using periodic features with dominant periods at T=2, 5, 10. In this paper, we identi...

Convex Low-resource Accent-Robust Language Detection in Speech Recognition

Globalization and multiculturalism continue to produce increasingly diverse speech varieties. Yet current spoken dialogue systems frequently fail on und...

Convolutional Neural Networks

From first principles - why CNNs exist, how the convolution operation works, weight sharing, hierarchical feature learning, receptive fields, 1x1 convolutions, and depthwise separable convolutions with PyTorch.

Correcting Split Selection in Online Decision Trees via Anytime-Valid Inference

Bagging-based ensembles, most notably Adaptive Random Forests, are among the strongest performers for learning from data streams. A common denominator a...

Cortex 2.0: Grounding World Models in Real-World Industrial Deployment

Industrial robotic manipulation demands reliable long-horizon execution across embodiments, tasks, and changing object distributions. While Vision-Langu...

CoSPlay: Cooperative Self-Play at Test-Time with Self-Generated Code and Unit Test

Recently, Reinforcement Learning with Verifiable Rewards (RLVR) and Test-Time Scaling (TTS) have advanced LLM code generation through executable verific...

Cost and Performance Trade-offs in Data Infrastructure

How to reason about the latency-throughput-cost triangle, diagnose expensive Spark jobs, optimize cloud data costs with partitioning and caching, and fix data skew that silently kills pipeline performance.

Cost Attribution and Accountability

Making ML teams own their costs - tagging strategy, per-model cost dashboards, chargeback model design, cost anomaly detection, and engineering incentives for cost efficiency.

Cost Management and Budget Alerts

Track LLM spend per user, team, and feature in real time. Enforce hard budget limits and trigger alerts before costs spiral - because the invoice arrives 30 days too late.

Cost Optimization Patterns

Practical LLM cost reduction - semantic caching, model routing, prompt compression, Anthropic prompt caching, output length control, cost attribution, and monitoring for production AI systems.

Count Anything

Object counting remains fragmented across domain-specific datasets and task formulations, despite rapid progress in generalist vision models. Existing c...

Counterfactual Evaluation

Evaluate new ML policies using logged data from an old policy - inverse propensity scoring, doubly robust estimators, and offline policy evaluation for when A/B tests are too expensive.

Counterfactual Explanations - What Would Have to Change for a Different Decision?

Counterfactual explanations answer 'what would need to change?' - the most actionable form of ML explanation, and the basis for GDPR compliance in automated decision-making.

Counting as a minimal probe of language model reliability

Large language models perform strongly on benchmarks in mathematical reasoning, coding and document analysis, suggesting a broad ability to follow instr...

Counting to Four is still a Chore for VLMs

Vision--language models (VLMs) have achieved impressive performance on complex multimodal reasoning tasks, yet they still fail on simple grounding skill...

Coverage-Aware Web Crawling for Domain-Specific Supplier Discovery via a Web--Knowledge--Web Pipeline

Identifying the full landscape of small and medium-sized enterprises (SMEs) in specialized industry sectors is critical for supply-chain resilience, yet...

CPCANet: Deep Unfolding Common Principal Component Analysis for Domain Generalization

Domain Generalization (DG) aims to learn representations that remain robust under out-of-distribution (OOD) shifts and generalize effectively to unseen...

cProfile and pstats - Function-Level Profiling

Master deterministic profiling with cProfile and pstats - reading profile output, sorting and filtering results, snakeviz visualization, profiling overhead, and real-world endpoint profiling.

CPU Memory Architecture for ML

How CPU memory hierarchy - L1/L2/L3 caches, DRAM, and NUMA topology - shapes ML data pipelines, DataLoader performance, and large model loading strategies on multi-socket servers.

CPU Pipeline and Instruction Execution

Learn how modern CPUs execute billions of instructions per second through pipelining, out-of-order execution, branch prediction, and superscalar design - and why these details matter for every ML engineer.

CPython Architecture - The Interpreter at Engineering Depth

Understand CPython's architecture at engineering depth - the execution pipeline, the eval loop, PyObject memory layout, integer caching, string interning, the small object allocator, and alternative Python implementations.

Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

Large language models (LLMs) can struggle to memorize factual knowledge in their parameters, often leading to hallucinations and poor performance on kno...

Craw4LLM: Efficient Web Crawling for LLM Pretraining.

Craw4LLM: Efficient Web Crawling for LLM Pretraining. - published at ACL 2025.

CreativeGame:Toward Mechanic-Aware Creative Game Generation

Large language models can generate plausible game code, but turning this capability into iterative creative improvement remains difficult. In practice,...

CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing

Recent advances in large language models have led to strong performance on reasoning and environment-interaction tasks, yet their ability for creative p...

CrewAI

CrewAI v0.80+: role-based multi-agent systems with Crew, Agent, Task, Process, and Flow - the most production-friendly multi-agent framework.

CrewAI Multi-Agent Systems

CrewAI in production - agents, tasks, crews, memory systems, Flows, and deep-dive patterns for role-based multi-agent pipelines.

CroCo: Cross-Lingual Contrastive Preference Tuning on Self-Generations

Prior work establishes that controlled contrastiveness between self-generated responses from large language models, set via reward scores, improves down...

CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models

Video prediction is increasingly viewed as a path toward generalizable world models, yet it remains unclear whether these systems learn underlying causa...

Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video

Talking face generation has gained significant attention as a core application of generative models. To enhance the expressiveness and realism of synthe...

Cross-scale Aligned Supervision for Training GANs

Modern GANs often introduce adversarial supervision on intermediate generator outputs and interpret the resulting multi-stage synthesis as coarse-to-fin...

Cross-Session Persistence

How to build agents whose memory survives restarts - architecture, storage backends, session restoration, and privacy-aware memory pruning for production systems.

Cross-Space Distillation: Teaching One-Step Students with Modern Diffusion Teachers

Modern one-step diffusion models achieve impressive quality through distribution-based timestep distillation. Yet, they rely on a critical assumption: T...

Cross-Tokenizer LLM Distillation through a Byte-Level Interface

Cross-tokenizer distillation (CTD), the transfer of knowledge from a teacher to a student language model when the two use different tokenizers, remains...

Crowded in B-Space: Calibrating Shared Directions for LoRA Merging

Merging separately trained LoRA adapters is a practical alternative to joint multi-task training, but it often hurts performance. Existing methods usual...

Cryptographic Hashing

Master data hashing vs password hashing - hashlib, bcrypt, argon2, salting, timing attacks, constant-time comparison, and why MD5/SHA1 are broken for passwords.

CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation

Camera-controllable video generation aims to synthesize videos with flexible and physically plausible camera movements. However, existing methods either...

CtrlVTON: Controllable Virtual Try-On via Visual-Instance-Prompt Segmentation

Virtual try-on (VTO) has made significant progress in realistically transferring garments onto a target person. Yet most systems give the user little co...

CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents

Reinforcement learning with verifiable rewards (RLVR) has driven breakthroughs in domains such as math, tool-use, and software engineering, yet its exte...

CubePart: An Open-Vocabulary Part-Controllable 3D Generator

Interactive 3D assets used in games and simulation are typically decomposed into specific semantic parts to support animation, physics, and scripted beh...

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

GPU kernel optimization is fundamental to modern deep learning but remains a highly specialized task requiring deep hardware expertise. Despite strong p...

CUDA Programming Model

Learn the CUDA programming model from first principles - host vs device execution, kernel launch syntax, the NVCC compilation pipeline, and how to write and compile your first GPU kernel from Python using torch.utils.cpp_extension.

CUDA Streams and Async Execution

Learn how CUDA streams enable concurrent GPU execution, how to overlap data transfers with computation using double buffering, how CUDA events work for synchronization and timing, and how PyTorch streams integrate with training pipelines for maximum throughput.

CUE-R: Beyond the Final Answer in Retrieval-Augmented Generation

As language models shift from single-shot answer generation toward multi-step reasoning that retrieves and consumes evidence mid-inference, evaluating t...

CUFE@NLU of Devanagari Script Languages 2025: Language Identification using fastText.

CUFE@NLU of Devanagari Script Languages 2025: Langua... - published at COLING 2025.

CUFE@VarDial 2025 NorSID: Multilingual BERT for Norwegian Dialect Identification and Intent Detection.

CUFE@VarDial 2025 NorSID: Multilingual BERT for Norw... - published at COLING 2025.

Cura 1T: Specialized Model for Agentic Healthcare

Healthcare spans high-stakes communication, expert reasoning, and workflow execution, yet specialized LLMs that cover these use cases together remain li...

CurveBench: A Benchmark for Exact Topological Reasoning over Nested Jordan Curves

We introduce CurveBench, a benchmark for hierarchical topological reasoning from visual input. CurveBench consists of 756 images of pairwise non-interse...

Custom Awaitables

Build awaitable objects with the __await__ protocol, understand how coroutines and Futures work under the hood, and create custom async primitives.

Custom Data Monitoring

Building custom monitoring with Great Expectations and statistical tests.

Customer Lifetime Value

CLV prediction with BG/NBD probabilistic models, Gamma-Gamma monetary value, deep learning on purchase sequences, RFM segmentation, and the ML systems that drive acquisition and retention budget decisions.

Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning

Parallel reasoning enhances Large Reasoning Models (LRMs) but incurs prohibitive costs due to futile paths caused by early errors. To mitigate this, pat...

CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation

Self-supervised surround-view depth estimation enables dense, low-cost 3D perception with a 360° field of view from multiple minimally overlapping image...

Cython and C Extensions

Learn how Cython bridges Python and C to deliver C-level performance in Python projects, covering type declarations, typed memoryviews, OpenMP parallelism, and raw C extension modules.

D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

The landscape of high-performance image generation models is currently shifting from the inefficient multi-step ones to the efficient few-step counterpa...

D^2-Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing

Despite the emergence of diffusion large language models (D-LLMs) as an alternative to autoregressive large language models (AR-LLMs), safety monitoring...

Dagster for Data Assets

Asset-based orchestration, software-defined assets, and Dagster's lineage model.

DARE - Delta Weight Sparsification

How DARE randomly drops delta weights and rescales the remainder to dramatically reduce interference when merging multiple fine-tuned models.

DARE: Diffusion Large Language Models Alignment and Reinforcement Executor

Diffusion large language models (dLLMs) are emerging as a compelling alternative to dominant autoregressive models, replacing strictly sequential token...

DarkForest: Less Talk, Higher Accuracy for Multi-Agent LLMs

Multi-agent LLM systems improve reasoning by combining outputs from multiple agents, but interaction-heavy methods can introduce error propagation and h...

DASR: Distributed Adaptive Scene Recognition - A Multi-Agent Cloud-Edge Framework for Language-Guided Scene Detection.

DASR: Distributed Adaptive Scene Recognition - A Mul... - published at EMNLP 2025.

Data Catalog and Discovery

Apache Atlas, DataHub, Amundsen - cataloguing data for ML teams.

Data Collection Strategy - Building the Moat Before Training the Model

Learn how to design data collection and labeling strategies that determine a model's fate before a line of training code is written - the most underestimated skill in ML engineering.

Data Contracts

Enforcing data quality agreements between producers and consumers - schema contracts with Pandera and Great Expectations, statistical contracts, SLA contracts, CI integration, and violation alerting.

Data Drift Detection

Detecting when input data distributions change in production - KS test, PSI, chi-squared, Wasserstein distance, MMD, univariate vs. multivariate drift, reference window selection, and EvidentlyAI.

Data Driven Optimization of GPU efficiency for Distributed LLM Adapter Serving

Large Language Model (LLM) adapters enable low-cost model specialization, but introduce complex caching and scheduling challenges in distributed serving...

Data Engineering with Python

The complete Python toolkit for data engineering - pandas memory optimization, PyArrow columnar processing, DuckDB analytical SQL, Polars lazy evaluation, and pipeline testing with pandera.

Data Governance for AI Training Datasets

What column-level security, data lineage, and cataloguing do for AI systems, when regulated AI training data requires auditability and access controls across the lakehouse, and how to implement governance with Apache Atlas and Unity Catalog in production AI data pipelines.

Data Incident Management

Runbooks, on-call rotations, and root cause analysis for data incidents.

Data Lake and Data Warehouse for ML

The evolution from database to data lake to lakehouse - when to use each storage architecture for ML training data, feature engineering, and model serving.

Data Lake vs Warehouse vs Lakehouse for AI Workloads

What each storage architecture does for AI systems, when ML teams need both raw unstructured data and structured query access on the same platform, and how to choose and implement the right architecture in production AI data pipelines.

Data Lineage

Column-level lineage, impact analysis, and tools like OpenLineage and DataHub.

Data Modelling for ML

How to design data models for machine learning - point-in-time correctness, entity-centric tables, SCD Type 2, label leakage prevention, and the training-serving skew problem.

Data Pipeline Patterns for AI/ML Workflows

ETL vs ELT, Lambda vs Kappa architecture, idempotency, exactly-once semantics, backfill strategies, watermarking for late data, and how to design pipelines that reliably serve both model training and real-time inference.

Data Platform Cost Optimisation for AI Teams

What query optimisation, storage tiering, and cloud cost controls do for AI systems, when large-scale model training and feature computation drive unpredictable cloud spend, and how to implement cost reduction strategies in production AI data pipelines.

Data Poisoning

Attacks that corrupt training or fine-tuning data to embed backdoors, trigger unexpected behaviors, or degrade model performance in production.

Data Quality and Filtering

Systematic approaches to filtering synthetic data for quality, diversity, safety, and alignment - the layered pipeline that separates fine-tuned models that work from models that regress.

Data Quality and Validation for ML

Why data quality is the number-one cause of ML failures in production - Great Expectations, data contracts, PSI distribution monitoring, and pipeline quality gates.

Data Serialization and Schemas

Why serialization format is an architectural decision - JSON vs Protocol Buffers vs Avro, schema evolution strategies, and how Confluent Schema Registry prevents breaking production pipelines.

Data Structures for ML Systems

Data structures for ML infrastructure - trie for tokenizers, HNSW for vector search, inverted index for retrieval, LSM trees for feature stores, and product quantization for memory-efficient vector storage.

Data Systems for ML - The Foundation Layer

The complete ML data stack - from raw storage through feature engineering to model training and serving, including data lakes, warehouses, lakehouses, and temporal joins.

Data Versioning with Delta Lake

ACID transactions, time travel, schema evolution, and training data versioning with Delta Lake - building reproducible ML pipelines on object storage.

Data-Efficient Non-Gaussian Semi-Nonparametric Density Estimation for Nonlinear Dynamical Systems

Accurate representation of non-Gaussian distributions of quantities of interest in nonlinear dynamical systems is critical for estimation, control, and...

Databricks

Databricks Lakehouse, Unity Catalog, MLflow integration, and AutoML.

Databricks for MLOps

Master the Databricks Lakehouse platform for ML - Delta Lake, Unity Catalog, Feature Store, MLflow Model Registry, Model Serving, and Spark-scale feature pipelines for production ML.

Dataclasses - Code Generation, Immutability, and Production Patterns

Master Python's @dataclass decorator at engineering depth - what it generates, field() and default_factory, frozen=True for immutability, __post_init__ for validation, ClassVar vs InitVar, inheritance with dataclasses, ordering, and production patterns in FastAPI and config systems.

Dataset Curation for Fine-Tuning

How to build high-quality fine-tuning datasets - sourcing, deduplication, quality filtering, LLM-as-judge scoring, and a complete curation pipeline. Why 5K curated examples beat 500K raw ones.

Dataset Lineage and Management

Tracking dataset provenance, preventing train/val/test leakage, stratified splitting, dataset registries, and discovering the CV team's 12% accuracy inflation from augmentation leakage.

DBSCAN and Density-Based Clustering

Master DBSCAN, OPTICS, HDBSCAN, and Mean Shift - density-based clustering algorithms that discover arbitrarily shaped clusters, handle varying densities, and identify anomalies without specifying the number of clusters.

dbt Advanced Patterns for ML Teams

Advanced dbt techniques for large-scale ML pipelines - snapshots for SCD2, point-in-time correct features, slim CI, dbt-utils macros, and production deployment patterns.

dbt for ML Feature Preparation

How dbt brings lineage, testing, documentation, and version control to SQL-based ML data pipelines, replacing fragile cron-driven script chains.

DDIM and Accelerated Diffusion Sampling

How DDIM reduces 1000-step DDPM sampling to 10-50 steps via a non-Markovian process, the eta parameter, DDIM inversion for image editing, and DPM-Solver as the current production standard.

DDPMs - The Mathematical Foundation of Diffusion Models

The complete mathematical derivation of Denoising Diffusion Probabilistic Models - forward process, reverse process, ELBO objective, noise schedule comparison, U-Net architecture, and why predicting noise works better than predicting clean images.

Debate and Critique Patterns

How LLMs critiquing each other improves quality: verifier/critic patterns, multi-agent debate, ensemble approaches, and convergence detection.

Decentralized Proximal Stochastic Gradient Langevin Dynamics

We propose Decentralized Proximal Stochastic Gradient Langevin Dynamics (DE-PSGLD), a decentralized Markov chain Monte Carlo (MCMC) algorithm for sampli...

Decentralized Ranking Aggregation: Gossip Algorithms for Borda and Copeland Consensus

The concept of ranking aggregation plays a central role in preference analysis, and numerous algorithms for calculating median rankings, often originati...

DecMem: Towards Minute-Long Consistent World Generation with Decoupled Memory

Recent advances in video generative models have promoted rapid progress in controllable world models. However, maintaining fine-grained spatio-temporal...

Decorators - Wrapping Callables at Engineering Depth

Master Python decorators at full engineering depth - functools.wraps, decorator factories with three-level nesting, class-based decorators, stacking order, production patterns (timing, retry, caching, rate limiting), and how FastAPI/Flask route decorators work under the hood.

Decoupled Descent: Exact Test Error Tracking Via Approximate Message Passing

In modern parametric model training, full-batch gradient descent (and its variants) suffers due to progressively stronger biasing towards the exact real...

Decoupling Communication from Policy: Robust MARL under Bandwidth Constraints

Communication enables coordination in multi-agent reinforcement learning (MARL), but many real-world applications, e.g., search-and-rescue with drone sw...

Deep Autocorrelation Modeling for Time-Series Forecasting: Progress and Prospects

Autocorrelation is a defining characteristic of time-series data, where each observation is statistically dependent on its predecessors. In the context...

Deep ensemble graph neural networks for probabilistic cosmic-ray direction and energy reconstruction in autonomous radio arrays

Using advanced machine learning techniques, we developed a method for reconstructing precisely the arrival direction and energy of ultra-high-energy cos...

Deep Q-Networks (DQN)

Scale Q-learning to high-dimensional inputs with neural networks. Learn the DQN architecture, experience replay, target networks, Double DQN, Dueling DQN, Prioritized Experience Replay, and Rainbow. Full PyTorch implementation included.

DeepLoop: Depth Scaling for Looped Transformers

Looped Transformers scale sequential computation by applying a compact stack of physical blocks for multiple rounds, increasing unrolled depth without i...

DeepSeek MoE Architecture

DeepSeek's innovations in mixture of experts - fine-grained experts, shared experts, DeepSeek-V2 and V3, multi-token prediction, and training for $6M.

DeepSeek-R1 - Open Source Reasoning

How DeepSeek built an open-weights reasoning model using pure RL with GRPO, the R1-Zero experiment, distillation into smaller models, and what open-source reasoning means for the research community.

DEFault++: Automated Fault Detection, Categorization, and Diagnosis for Transformer Architectures

Transformer models are widely deployed in critical AI applications, yet faults in their attention mechanisms, projections, and other internal components...

Defending Quantum Classifiers against Adversarial Perturbations through Quantum Autoencoders

Machine learning models can learn from data samples to carry out various tasks efficiently. When data samples are adversarially manipulated, such as by...

Delta Lake

Delta Lake on Databricks, merge operations, and Change Data Capture.

Delta Lake and Iceberg for ML

Delta Lake as ML data infrastructure - ACID transactions, time travel, schema evolution, Delta + MLflow integration, OPTIMIZE/Z-ordering, and handling schema changes without breaking pipelines.

Demand Forecasting Systems

Hierarchical time series forecasting at retail scale - classical methods, gradient boosting, deep learning with TFT, and the engineering behind forecasting millions of SKUs in real time.

DEMO: Reframing Dialogue Interaction with Fine-grained Element Modeling.

DEMO: Reframing Dialogue Interaction with Fine-grain... - published at ACL 2025.

DEMON: Diffusion Engine for Musical Orchestrated Noise

We present DEMON, a real-time diffusion engine that makes the denoising process playable as a live musical instrument: a control surface both broad (man...

Demystifying On-Policy Distillation: Roles, Pathologies, and Regulations

On-policy distillation (OPD) has become a key paradigm in LLM post-training, yet its training dynamics remain poorly understood. We present a systematic...

Demystifying When Pruning Works via Representation Hierarchies

Network pruning, which removes less important parameters or architectures, is often expected to improve efficiency while preserving performance. However...

DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes

Reinforcement learning has become a central paradigm for advancing reasoning in large language models, yet most existing methods still depend on stronge...

DeonticBench: A Benchmark for Reasoning over Rules

Reasoning with complex, context-specific rules remains challenging for large language models (LLMs). In legal and policy settings, this manifests as deo...

Dependency Injection - Decoupling Components

Master dependency injection in Python from manual constructor injection to DI containers and FastAPI Depends, with testing strategies and architectural trade-offs.

Dependency Management and Packaging

Master Python packaging from pyproject.toml and uv to Docker layer caching, private registries, and the CUDA version compatibility matrix that determines whether your ML environment actually works.

Deploying Quantized Models in Production

End-to-end guide for production deployment of quantized LLMs - format selection, serving stack configuration, latency SLAs, A/B testing, quality monitoring, and rollback strategy.

Descriptors - The Protocol That Powers Python's Object Model

Master the descriptor protocol - __get__, __set__, __delete__, data vs non-data descriptors, the complete attribute lookup algorithm, and how property, classmethod, staticmethod, and bound methods work under the hood.

Design Experiments to Compare Multi-armed Bandit Algorithms

Online platforms routinely compare multi-armed bandit algorithms, such as UCB and Thompson Sampling, to select the best-performing policy. Unlike standa...

Design Patterns in Python - Idiomatic Implementations for Production Code

Master the most important GoF design patterns in idiomatic Python - Singleton, Factory, Abstract Factory, Strategy, Observer, Decorator, Registry, and Builder. For each - GoF intent, Pythonic implementation, and real framework usage.

Designing a Content Moderation System

End-to-end design of a large-scale content moderation system - covering multi-modal ML pipelines, human review integration, active learning, adversarial robustness, and platform-scale architecture.

Designing a Fraud Detection System at Scale

End-to-end design of a real-time fraud detection system - covering feature engineering, imbalanced learning, streaming scoring, delayed labels, and graph-based fraud ring detection.

Designing a Recommendation System at Scale

End-to-end design of a recommendation system serving billions of items to millions of users - covering two-stage architecture, candidate generation, ranking, cold start, and serving at scale.

Designing a Search Ranking System

End-to-end design of a production search ranking system - covering query understanding, BM25 + dense retrieval, Learning to Rank, semantic reranking, and A/B testing metrics.

Detecting and Suppressing Reward Hacking with Gradient Fingerprints

Reinforcement learning with verifiable rewards (RLVR) typically optimizes for outcome rewards without imposing constraints on intermediate reasoning. Th...

DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation

Recent advances in video generative models enable the synthesis of realistic human-object interaction videos across a wide range of scenarios and object...

DexJoCo: A Benchmark and Toolkit for Task-Oriented Dexterous Manipulation on MuJoCo

Achieving human-level manipulation requires dexterous robotic hands capable of complex object interactions. Advancing such capabilities further demands...

DGX and HGX System Design

NVIDIA DGX H100 and HGX reference designs - 8-GPU NVLink mesh, NVSwitch fabric, PCIe host bridge, ConnectX InfiniBand, power and cooling requirements, DGX SuperPOD scale-out, and topology-aware NCCL configuration for maximum distributed training throughput.

Different Time, Different Language: Revisiting the Bias Against Non-Native Speakers in GPT Detectors.

Different Time, Different Language: Revisiting the B... - published at EACL 2026.

Differentiable Zero-One Loss via Hypersimplex Projections

Recent advances in machine learning have emphasized the integration of structured optimization components into end-to-end differentiable models, enablin...

DiffNR: Diffusion-Enhanced Neural Representation Optimization for Sparse-View 3D Tomographic Reconstruction

Neural representations (NRs), such as neural fields and 3D Gaussians, effectively model volumetric data in computed tomography (CT) but suffer from seve...

Diffusion Model as a Generalist Segmentation Learner

Diffusion models are primarily trained for image synthesis, yet their denoising trajectories encode rich, spatially aligned visual priors. In this paper...

Diffusion Models

How denoising diffusion models learn to reverse a Gaussian noise process to generate high-quality images from text prompts.

Diffusion Models Beyond Images - Audio, Video, 3D, Molecules, Text

How the diffusion framework generalizes across modalities - from waveform audio synthesis to protein structure prediction, video generation, 3D scene creation, time series, and text - with the architectural changes each domain requires.

Digital Twins and Simulation

Learn how digital twins combine physics-based simulation with machine learning to create virtual replicas of manufacturing systems for prediction, optimization, and what-if analysis.

DiningBench: A Hierarchical Multi-view Benchmark for Perception and Reasoning in the Dietary Domain

Recent advancements in Vision-Language Models (VLMs) have revolutionized general visual understanding. However, their application in the food domain rem...

Direct Bayesian Additive Regression Trees for Conditional Average Treatment Effects in Regression Discontinuity Designs

Regression discontinuity designs (RDD) are widely used for causal inference. In many empirical applications, treatment effects vary substantially with c...

Direct Preference Optimisation - RLHF Without the RL

DPO: how Rafailov et al. (2023) showed that RLHF has a closed-form solution - no reward model, no PPO, just supervised training on preference pairs.

Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models

Reward hacking arises when a model improves a proxy reward by exploiting shortcuts rather than solving the intended task. We study this failure mode thr...

Disassembly with dis - Reading CPython Bytecode

Master Python bytecode disassembly with the dis module at engineering depth - reading disassembly output, key opcodes explained, value stack evolution, comparing equivalent Python patterns at the instruction level, and practical performance insights.

Discovering Thermodynamically Admissible Dissipation Potentials via Grammar-Based Symbolic Regression

Constitutive laws for inelastic materials must satisfy strict thermodynamic admissibility requirements, yet current data-driven approaches sacrifice int...

Discrete Diffusion Models: A Unified Framework from Tokenization to Generation

Discrete denoising diffusion models (DDMs) have recently emerged as a compelling alternative to autoregressive (AR) modeling for discrete data, offering...

Dissecting Quantization Error: A Concentration-Alignment Perspective

Quantization can drastically increase the efficiency of large language and vision models, but typically incurs an accuracy drop. Recently, function-pres...

Distillation Datasets

Building distillation datasets: capturing frontier model knowledge, reasoning traces, and calibration into training data for smaller, efficient models - from Orca to Phi.

Distributed Training Strategies

Master data parallelism (DDP, FSDP), tensor parallelism, pipeline parallelism, 3D parallelism, gradient accumulation, all-reduce communication, and bandwidth requirements for training large models.

Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems

Claude Code is an agentic coding tool that can run shell commands, edit files, and call external services on behalf of the user. This study describes it...

Diverse Dictionary Learning

Given only observational data X = g(Z), where both the latent variables Z and the generating process g are unknown, recovering Z is ill-posed without ad...

DIVINE : Coordinating Multimodal Disentangled Representations for Oro-Facial Neurological Disorder Assessment.

DIVINE : Coordinating Multimodal Disentangled Repres... - published at EACL 2026.

DMax: Aggressive Parallel Decoding for dLLMs

We present DMax, a new paradigm for efficient diffusion language models (dLLMs). It mitigates error accumulation in parallel decoding, enabling aggressi...

dMoE: dLLMs with Learnable Block Experts

Diffusion Large Language Models (dLLMs) have recently emerged as a promising alternative to autoregressive models, offering competitive performance whil...

DNS, Service Discovery, and Consul

Master DNS and service discovery for distributed ML systems - DNS resolution chains, Kubernetes CoreDNS, Consul service mesh, etcd coordination, and how ML serving clusters register and find model endpoints dynamically.

Do AI Coding Agents Log Like Humans? An Empirical Study

Software logging is essential for maintaining and debugging complex systems, yet it remains unclear how AI coding agents handle this non-functional requ...

Do Audio-Visual Large Language Models Really See and Hear?

Audio-Visual Large Language Models (AVLLMs) are emerging as unified interfaces to multimodal perception. We present the first mechanistic interpretabili...

Do Image-Text Metrics Respect Semantic Invariances?

Do Image-Text Metrics Respect Semantic Invariances? — published at ACL 2026.

Do Sparse Autoencoders Capture Concept Manifolds?

Sparse autoencoders (SAEs) are widely used to extract interpretable features from neural network representations, often under the implicit assumption th...

Do Thought Streams Matter? Evaluating Reasoning in Gemini Vision-Language Models for Video Scene Understanding

We benchmark how internal reasoning traces, which we call thought streams, affect video scene understanding in vision-language models. Using four config...

Docker and Containerized Local Inference

Running LLMs in Docker containers for reproducibility and deployment portability. NVIDIA Container Toolkit, Ollama and vLLM Docker images, multi-stage builds, and Docker Compose for a full local AI stack.

Docker Compose for ML Development

Build a complete local ML development environment with Docker Compose - training, serving, feature store, and monitoring all running with a single command.

Docker for ML

Learn Docker fundamentals from an ML perspective - why containers matter, how to write effective Dockerfiles, and how to manage ML model files in containers.

Document Chunking Strategies

Master the art and science of splitting documents into chunks that maximize retrieval precision - the most underestimated decision in RAG system design.

Document Ingestion and Chunking

Master every chunking strategy from fixed-size to semantic and structure-aware splitting. Learn how to parse PDFs, DOCX, and HTML, enrich metadata, evaluate chunk quality, and build a production-grade ingestion pipeline.

Document Review at Scale

e-Discovery, technology-assisted review (TAR), predictive coding, and building ML systems that process millions of documents for legal discovery in weeks instead of years.

Does Generative AI speak Nigerian-Pidgin?: Issues about Representativeness and Bias for Multilingualism in LLMs.

Does Generative AI speak Nigerian-Pidgin?: Issues ab... - published at NAACL 2025.

Does RAG Introduce Unfairness in LLMs? Evaluating Fairness in Retrieval-Augmented Generation Systems.

Does RAG Introduce Unfairness in LLMs? Evaluating Fa... - published at COLING 2025.

Does Seeing More Mean Knowing More? Mono-Anchored Advantage Normalization for Multi-Source Visual Reasoning

Visual reasoning through reinforcement learning with verifiable rewards (RLVR) has achieved remarkable progress. However, when dealing with multi-source...

Does Synthetic Layered Design Data Benefit Layered Design Decomposition?

Recent advances in image generation have made it easy to produce high-quality images. However, these outputs are inherently flattened, entangling foregr...

Domain-Specific Latent Representations Improve the Fidelity of Diffusion-Based Medical Image Super-Resolution

Latent diffusion models for medical image super-resolution universally inherit variational autoencoders designed for natural photographs. We show that t...

Don't Retrieve, Navigate: Distilling Enterprise Knowledge into Navigable Agent Skills for QA and RAG

Retrieval-Augmented Generation (RAG) grounds LLM responses in external evidence but treats the model as a passive consumer of search results: it never s...

DPO and Modern Alignment Techniques

Direct Preference Optimization and its successors - how DPO eliminates the need for a separate reward model and RL training, plus IPO, KTO, SimPO, and ORPO.

DPO: Direct Preference Optimization

Master DPO - the elegant insight that you can optimize LLMs for human preferences without training a reward model or running RL, derived directly from the optimal RLHF policy.

DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data

Edge-scale deep research agents based on small language models are attractive for real-world deployment due to their advantages in cost, latency, and pr...

DR^{3}-Eval: Towards Realistic and Reproducible Deep Research Evaluation

Deep Research Agents (DRAs) aim to solve complex, long-horizon research tasks involving planning, retrieval, multimodal understanding, and report genera...

DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization

Large language models are increasingly deployed in multi-turn interactive settings where users or environments can iteratively provide lightweight feedb...

Driving Chinese Spelling Correction from a Fine-Grained Perspective.

Driving Chinese Spelling Correction from a Fine-Grai... - published at COLING 2025.

Dropout and Regularization

Complete guide to dropout mechanics and inverted scaling, L1 vs L2 regularization and weight decay math, Monte Carlo Dropout for uncertainty, Batch Normalization as implicit regularizer, label smoothing cross-entropy derivation, DropConnect and DropPath variants, and a production-quality regularized training loop in PyTorch.

Drug Discovery with AI

How AI accelerates pharmaceutical research - AlphaFold protein structure prediction, graph neural networks for molecular property prediction, generative chemistry, and virtual screening for drug candidates.

DrugGen 2: A disease-aware language model for enhancing drug discovery

Current computational approaches for drug design typically focus on generating molecules conditioned on specific targets or general molecular properties...

DSBD: Dual-Aligned Structural Basis Distillation for Graph Domain Adaptation

Graph domain adaptation (GDA) aims to transfer knowledge from a labeled source graph to an unlabeled target graph under distribution shifts. However, ex...

DSpark: Confidence-Scheduled Speculative Decoding with Semi-Autoregressive Generation

Speculative decoding accelerates Large Language Model (LLM) inference by decoupling draft generation from target verification. While recent parallel dra...

DSWorld: A Data Science World Model for Efficient Autonomous Agents

Despite strong capabilities in data understanding and decision-making, autonomous data science agents still heavily rely on trial-and-error workflows th...

Dual Debiasing for Noisy In-Context Learning for Text Generation.

Dual Debiasing for Noisy In-Context Learning for Tex... - published at ACL 2025.

Dual Latent Memory in Vision-Language-Action Models for Robotic Manipulation

Mainstream Vision-Language-Action (VLA) models predict actions primarily from the current observation under a Markovian assumption, thus struggling with...

Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks

Multimodal web agents that process both screenshots and accessibility trees are increasingly deployed to interact with web interfaces, yet their dual-st...

Dual-View Training for Instruction-Following Information Retrieval

Instruction-following information retrieval (IF-IR) studies retrieval systems that must not only find documents relevant to a query, but also obey expli...

Dunder Methods - Python's Protocol System at Engineering Depth

Master Python's dunder (double-underscore) method system - comparison protocols, arithmetic operators, container protocols, context managers, callable objects, and attribute access hooks. Learn how Python's syntax maps to method calls.

DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios

Real-world data visualization (DV) requires native environmental grounding, cross-platform evolution, and proactive intent alignment. Yet, existing benc...

DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning

Reinforcement Learning has become a standard paradigm for aligning Large Language Models with human intent and task requirements. While Group Relative P...

DVC: Data Version Control

DVC in production - pointer files, remote storage, pipeline definitions (dvc.yaml), caching, dvc repro, CI/CD integration, and versioning 500GB datasets without bloating git.

dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model

Evaluating robotics policies across thousands of environments and thousands of tasks is infeasible with existing approaches. This motivates the need for...

DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation

Robot manipulation critically depends on perception that preserves the action-relevant aspects of a scene. Yet most robot learning pipelines are built u...

Dynamic Class Creation - Building Classes at Runtime

Master the type() three-argument form, the full class creation pipeline, code generation with exec and compile, namedtuple internals, __prepare__, and building DSLs that generate Python classes at runtime.

Dynamic Pricing Models

Price elasticity estimation, competitor-aware pricing, markdown optimization for seasonal goods, causal inference for pricing decisions, and the ML systems behind Amazon's real-time repricing engine.

Dynamic Programming for ML

Dynamic programming patterns in ML - edit distance for NLP evaluation, Viterbi decoding for sequence labeling, CTC for speech recognition, dynamic time warping, beam search, Bellman equations in reinforcement learning, and DP in autoregressive generation.

Dynamic Programming for RL

Policy evaluation, policy iteration, and value iteration - solving MDPs exactly when you know the environment model. Master the theoretical foundation that all model-free RL approximates.

EarlyTom: Early Token Compression Completes Fast Video Understanding

Video large language models (Video-LLMs) have demonstrated strong capabilities in video understanding tasks. However, their practical deployment is stil...

EASE: Federated Multimodal Unlearning via Entanglement-Aware Anchor Closure

Federated Multimodal Learning (FML) trains multimodal models across decentralized clients while keeping their image-text pairs private. However, joint e...

EasyDistill: A Comprehensive Toolkit for Effective Knowledge Distillation of Large Language Models.

EasyDistill: A Comprehensive Toolkit for Effective K... - published at EMNLP 2025.

EasyVideoR1: Easier RL for Video Understanding

Reinforcement learning from verifiable rewards (RLVR) has demonstrated remarkable effectiveness in improving the reasoning capabilities of large languag...

EB-RANSAC: Random Sample Consensus based on Energy-Based Model

Random sample consensus (RANSAC), which is based on a repetitive sampling from a given dataset, is one of the most popular robust estimation methods. In...

ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion

Chest X-ray report generation (CXR-RG) has the potential to substantially alleviate radiologists' workload. However, conventional autoregressive vision-...

ECHO: Terminal Agents Learn World Models for Free

CLI agents are the closest thing language models have to an embodied setting: the model emits commands, the terminal executes them, and the returned str...

Edge AI in Manufacturing

Learn how to deploy AI models on industrial edge hardware using TensorRT quantization, ONNX Runtime, OpenVINO, MQTT-based edge-cloud architectures, and fleet management for hundreds of edge devices.

Edge and Mobile Inference

Running neural networks on devices with 5-15W power budgets - mobile NPUs, Apple Neural Engine, Qualcomm Hexagon, deployment frameworks, and LLMs on-device with llama.cpp and MLX.

Edge ML Deployment

Deploying ML models to smartphones, IoT devices, and embedded systems - model compression, edge runtimes, OTA updates, federated learning, and real-world examples.

EditCrafter: Tuning-free High-Resolution Image Editing via Pretrained Diffusion Model

We propose EditCrafter, a high-resolution image editing method that operates without tuning, leveraging pretrained text-to-image (T2I) diffusion models...

EDU-CIRCUIT-HW: Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions

Multimodal Large Language Models (MLLMs) hold significant promise for revolutionizing traditional education and reducing teachers' workload. However, ac...

Effective Biological Representation Learning by Masking Gene Expression

RNA sequencing produces rich and diverse datasets of gene expression, offering compelling insights into cellular state and function that have many appli...

Effective sample size approximations as entropy measures

In this work, we analyze alternative effective sample size (ESS) metrics for importance sampling algorithms, and discuss a possible extended range of ap...

Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers.

Efficiency-Effectiveness Reranking FLOPs for LLM-bas... - published at EMNLP 2025.

Efficient Agentic Reinforcement Learning with On-Policy Intrinsic Knowledge Boundary Enhancement

Agentic reinforcement learning (RL) has proven effective for training LLM-based agents with external tool-use capabilities. However, we identify that ag...

Efficient and Scalable Provenance Tracking for LLM-Generated Code Snippets

Large language models (LLMs) for code completion and generation are increasingly used in software development, yet they may reproduce training examples...

Efficient Discovery of Approximate Causal Abstractions via Neural Mechanism Sparsification

Neural networks are hypothesized to implement interpretable causal mechanisms, yet verifying this requires finding a causal abstraction -- a simpler, hi...

Efficient Multivector Retrieval with Token-Aware Clustering and Hierarchical Indexing

Multivector retrieval models achieve state-of-the-art effectiveness through fine-grained token-level representations, but their deployment incurs substa...

Efficient Refusal Ablation in LLM through Optimal Transport

Safety-aligned language models refuse harmful requests through learned refusal behaviors encoded in their internal representations. Recent activation-ba...

Efficient RL Training for LLMs with Experience Replay

While Experience Replay - the practice of storing rollouts and reusing them multiple times during training - is a foundational technique in general RL,...

Efficient Targeted Maximum Likelihood Estimators for Two-Phase Design Problems

In a typical two-phase design, a random sample is drawn from the target population in phase 1, during which only a subset of variables is collected. In...

Efficient Training on Multiple Consumer GPUs with RoundPipe

Fine-tuning Large Language Models (LLMs) on consumer-grade GPUs is highly cost-effective, yet constrained by limited GPU memory and slow PCIe interconne...

EgoSteer: A Full-Stack System Towards Steerable Dexterous Manipulation from Egocentric Videos

Steerability is a defining capability of generalist robot policies, yet remains largely absent in dexterous-hand systems for lack of large-scale, langua...

Eliciting Medical Reasoning with Knowledge-enhanced Data Synthesis: A Semi-Supervised Reinforcement Learning Approach

While large language models hold promise for complex medical applications, their development is hindered by the scarcity of high-quality reasoning data....

ELT: Elastic Looped Transformers for Visual Generation

We introduce Elastic Looped Transformers (ELT), a highly parameter-efficient class of visual generative models based on a recurrent transformer architec...

Embedding Models - The Landscape

A comprehensive survey of the embedding model ecosystem - SBERT, contrastive learning, SimCSE, E5, BGE, GTE, OpenAI, Voyage AI, Cohere, and the MTEB leaderboard.

Embedding Models Deep Dive

Master embedding model selection for retrieval - MTEB benchmarks, model families, Matryoshka embeddings, bi-encoders vs cross-encoders, and fine-tuning strategies.

Embedding Models in Production

How to choose, deploy, and manage embedding models at scale - including versioning, caching, batching, and migration strategies for production RAG systems.

Embedding Quantization

Reducing embedding storage and search costs - float32 to float16, int8, and binary quantization, Hamming distance search, the rescoring trick, and implementation with FAISS and Qdrant.

Embedding Spaces

How token embeddings form a dense vector space that captures semantic meaning - geometry, anisotropy, weight tying, and visualization.

Embedding Stores

Storing and serving dense embeddings at scale for real-time recommendation and search.

Embeddings in Production

Build, deploy, and operate production-grade embedding pipelines - caching, incremental indexing, staleness management, vector DB selection, and cost optimization at scale.

Emergent Compositional Communication for Latent World Properties

Can multi-agent communication pressure extract discrete, compositional representations of invisible physical properties from frozen video features? We s...

Emergent Languages in Populations of Language Model Agents: From Token Efficiency to Oversight Evasion

Monitoring autonomous language model agents currently relies mostly on surface behavior. But what happens when agent populations invent new languages wi...

Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs.

Emergent Misalignment via In-Context Learning: Narro... — published at ACL 2026.

EMO: Pretraining Mixture of Experts for Emergent Modularity

Large language models are typically deployed as monolithic systems, requiring the full model even when applications need only a narrow subset of capabil...

Empathy Prediction from Diverse Perspectives.

Empathy Prediction from Diverse Perspectives. - published at ACL 2025.

Encapsulation and Data Hiding - Properties, Name Mangling, and Descriptors

Master Python's encapsulation model - single vs double underscore conventions, name mangling mechanics, @property for controlled access, validation in setters, __slots__, and the descriptor protocol that powers @property, @classmethod, and @staticmethod internally.

Encoder vs Decoder vs Encoder-Decoder

Comparing encoder-only, decoder-only, and encoder-decoder transformer architectures - when to use each and why decoder-only won.

Encoder-Free Human Motion Understanding via Structured Motion Descriptions

The world knowledge and reasoning capabilities of text-based large language models (LLMs) are advancing rapidly, yet current approaches to human motion...

Enhanced Privacy and Communication Efficiency in Non-IID Federated Learning with Adaptive Quantization and Differential Privacy

Federated learning (FL) is a distributed machine learning method where multiple devices collaboratively train a model under the management of a central...

Enhancing AI and Dynamical Subseasonal Forecasts with Probabilistic Bias Correction

Decision-makers rely on weather forecasts to plant crops, manage wildfires, allocate water and energy, and prepare for weather extremes. Today, such for...

Enhancing Authorship Attribution with Synthetic Paintings

Attributing authorship to paintings is a historically complex task, and one of its main challenges is the limited availability of real artworks for trai...

Enhancing Hyperspace Analogue to Language (HAL) Representations via Attention-Based Pooling for Text Classification

The Hyperspace Analogue to Language (HAL) model relies on global word co-occurrence matrices to construct distributional semantic representations. While...

Enhancing In-context Panoramic Generation via Geometric-aware Pretraining

In this work, we present Canvas360, a two-stage framework for in-context panoramic generation that combines geometry-aware pretraining with downstream t...

Enhancing Open-Domain Task-Solving Capability of LLMs via Autonomous Tool Integration from GitHub.

Enhancing Open-Domain Task-Solving Capability of LLM... - published at ACL 2025.

Enhancing Reliability in Community Question Answering with an Expert-Oriented RAG System.

Enhancing Reliability in Community Question Answerin... - published at EACL 2026.

Enhancing Robustness of Federated Learning via Server Learning

This paper explores the use of server learning for enhancing the robustness of federated learning against malicious attacks even when clients' training...

EnsemW2S: Enhancing Weak-to-Strong Generalization with Large Language Model Ensembles.

EnsemW2S: Enhancing Weak-to-Strong Generalization wi... — published at ACL 2026.

Entropic Projection Alignment: Estimating, Explaining, and Improving Model Performance Under Distribution Shift

We propose a unified framework for addressing three key challenges of distribution shift: (1) estimating a model's performance on an unlabeled target do...

Environment Parity

Solve the dev/staging/prod parity problem for ML - feature skew, infrastructure differences, data drift, and environment promotion pipelines that prevent production surprises.

Envisioning the Future, One Step at a Time

Accurately anticipating how complex, diverse scenes will evolve requires models that represent uncertainty, simulate along extended interaction chains,...

Episodic Memory with Vector Store

Implement agent episodic memory using vector databases: storing, retrieving, consolidating, and forgetting past experiences at scale.

EquiformerV3: Scaling Efficient, Expressive, and General SE(3)-Equivariant Graph Attention Transformers

As SE(3)-equivariant graph neural networks mature as a core tool for 3D atomistic modeling, improving their efficiency, expressivity, and physical consi...

ESARBench: A Benchmark for Agentic UAV Embodied Search and Rescue

The rapid advancement of Multimodal Large Language Models (MLLMs) has empowered Unmanned Aerial Vehicle (UAV) with exceptional capabilities in spatial r...

ESC-Skills: Discovering and Self-Evolving Skills for Emotional Support Conversations

Existing emotional support conversation (ESC) systems mainly rely on end-to-end response generation or coarse strategy supervision, offering limited int...

Escalation and Handoff Patterns

Designing AI systems that know when to stop and hand off to humans - confidence thresholds, sentiment detection, topic-based routing, context transfer, and escalation orchestration.

Escape dynamics and implicit bias of one-pass SGD in overparameterized quadratic networks

We analyze the one-pass stochastic gradient descent dynamics of a two-layer neural network with quadratic activations in a teacher--student framework. I...

Ethics and AI in Education

Learn FERPA compliance, algorithmic bias in educational AI, surveillance concerns, data minimization, transparency requirements, and responsible deployment of AI in learning environments.

EU AI Act and Global AI Regulation

The EU AI Act, US executive orders, UK AI policy, China AI regulations, and practical compliance implications for AI engineers building and deploying language models.

Evaluating Embedding Models

MTEB benchmark deep dive, nDCG@10, Recall@K, MRR, MAP, building domain-specific evaluation sets, running MTEB locally, and avoiding the contamination problem.

Evaluating Fine-Tuned Models

Evaluation strategies for fine-tuned LLMs - held-out test sets, LLM-as-judge evaluation, perplexity measurement, task-specific benchmarks, and avoiding evaluation pitfalls.

Evaluating Generative Models - FID, IS, Precision/Recall, Human Evaluation

A complete guide to evaluating generative models - from the mathematics of FID and Inception Score to Precision/Recall manifolds, CLIP-based metrics, DINO similarity, human preference studies, metric gaming, and building production evaluation pipelines.

Evaluating Reasoning Models

The benchmark landscape for reasoning models - AIME, MATH-500, Codeforces, ARC-AGI, GPQA Diamond, process vs. outcome evaluation, and contamination concerns.

Evaluating the Progression of Large Language Model Capabilities for Small-Molecule Drug Design

Large Language Models (LLMs) have the potential to accelerate small molecule drug design due to their ability to reason about information from diverse s...

Evaluating the Quality of ML Explanations - Faithfulness, Robustness, and Human Studies

How to measure whether an ML explanation is actually good - faithfulness metrics, the ROAR benchmark, sanity checks, human evaluation studies, and a complete quantitative evaluation pipeline.

Evaluating Zero-Shot and One-Shot Adaptation of Small Language Models in Leader-Follower Interaction

Leader-follower interaction is an important paradigm in human-robot interaction (HRI). Yet, assigning roles in real time remains challenging for resourc...

Evaluation of Deontic Conditional Reasoning in Large Language Models: The Case of Wason's Selection Task.

Evaluation of Deontic Conditional Reasoning in Large... - published at EACL 2026.

Evaluation-Driven Development

Building AI systems test-first - write evals before writing prompts. The EDD loop, eval strategies, golden dataset construction, LLM-as-judge calibration, and a full EvalSuite implementation ready for CI integration.

Evaluation-driven Scaling for Scientific Discovery

Language models are increasingly used in scientific discovery to generate hypotheses, propose candidate solutions, implement systems, and iteratively re...

EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation

The rapid evolution of generative video foundation models has propelled the field toward professional-grade cinematic synthesis. To achieve such demandi...

Event Sourcing for ML Systems

Learn how event sourcing enables auditable, reproducible ML systems - covering the event log, Kafka as an event store, temporal queries, and the projection pattern.

Event-Driven Architecture for ML

Event sourcing and CQRS patterns for ML systems - event-driven state management, Kafka Streams for ML pipelines, event schema design, dead letter queues, and event replay for debugging.

Event-Driven ML Architecture

Designing ML systems around events - event sourcing, CQRS for feature stores, the outbox pattern, and how LinkedIn's unified messaging platform drives ML at scale.

Event-Driven Temporal Graph Networks for Asynchronous Multi-Agent Cyber Defense in NetForge_RL

The transition of Multi-Agent Reinforcement Learning (MARL) policies from simulated cyber wargames to operational Security Operations Centers (SOCs) is...

EverAnimate: Minute-Scale Human Animation via Latent Flow Restoration

We propose EverAnimate, an efficient post-training method for long-horizon animated video generation that preserves visual quality and character identit...

Everything at Every Scale: Scale-Invariant Diffusion with Continuous Super-Resolution

Creating images from noise is image generation; reconstructing fine details from coarse inputs is super-resolution. Despite their practical differences,...

Evidence-Backed Video Question Answering

Current Video Large Language Models (Video LLMs) excel in question answering (QA) but largely operate as black boxes, providing textual answers without...

Evol-Instruct

Evol-Instruct: systematically evolving instruction datasets to create complex, diverse training data that produces stronger instruction-following models - the technique behind WizardLM and WizardCoder.

EvoMaster: A Foundational Agent Framework for Building Evolving Autonomous Scientific Agents at Scale

The convergence of large language models and agents is catalyzing a new era of scientific discovery: Agentic Science. While the scientific method is inh...

EXAONE 4.5 Technical Report

This technical report introduces EXAONE 4.5, the first open-weight vision language model released by LG AI Research. EXAONE 4.5 is architected by integr...

Experience Transfer for Multimodal LLM Agents in Minecraft Game

Multimodal LLM agents operating in complex game environments must continually reuse past experience to solve new tasks efficiently. In this work, we pro...

Experiment Tracking

Design and govern ML experiment tracking at scale - from MLflow architecture to organizing 50 data scientists' experiments without chaos.

Experimentation and A/B Testing for ML Systems

How to design statistically rigorous experiments for ML systems - Bayesian vs frequentist A/B tests, network interference, interleaving, switchback experiments, and guardrail metrics.

Experimentation Platforms

Build and operate ML experimentation infrastructure - assignment services, metric computation pipelines, analysis tools, and the engineering required to scale from 3 to 30 experiments per month.

Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts

Mixture-of-Experts (MoE) has become the dominant architecture for scaling large language models: frontier models routinely decouple total parameters fro...

Expert-Choice Routing Enables Adaptive Computation in Diffusion Language Models

Diffusion language models (DLMs) enable parallel, non-autoregressive text generation, yet existing DLM mixture-of-experts (MoE) models inherit token-cho...

Explainability in Production

Serving model explanations alongside predictions - SHAP for production, Anchors for rule-based explanations, explanation as a service, debugging production failures with explanations, and regulatory compliance.

Explainability in Production ML Systems - Monitoring, Latency, and Compliance

How to operationalize ML explainability at scale - latency budgets, caching strategies, drift monitoring, compliance audit trails, and production architecture patterns for regulated industries.

Explainable cluster analysis: a bagging approach

A major limitation of clustering approaches is their lack of explainability: methods rarely provide insight into which features drive the grouping of si...

Explainable Disentangled Representation Learning for Generalizable Authorship Attribution in the Era of Generative AI

Learning robust representations of authorial style is crucial for authorship attribution and AI-generated text detection. However, existing methods ofte...

Explainable Load Forecasting with Covariate-Informed Time Series Foundation Models

Time Series Foundation Models (TSFMs) have recently emerged as general-purpose forecasting models and show considerable potential for applications in en...

Explicit Trait Inference for Multi-Agent Coordination.

Explicit Trait Inference for Multi-Agent Coordination. — published at ACL 2026.

Exploiting Subgradient Sparsity in Max-Plus Neural Networks

Deep Neural Networks are powerful tools for solving machine learning problems, but their training often involves dense and costly parameter updates. In...

Exploration and Exploitation Errors Are Measurable for Language Model Agents

Language Model (LM) agents are increasingly used in complex open-ended decision-making tasks, from AI coding to physical AI. A core requirement in these...

Exploration Hacking: Can LLMs Learn to Resist RL Training?

Reinforcement learning (RL) has become essential to the post-training of large language models (LLMs) for reasoning, agentic capabilities and alignment....

Exploring Autonomous Agentic Data Engineering for Model Specialization

Large Language Models (LLMs) have demonstrated strong performance on general tasks, while often struggling to adapt to specialized domains without high-...

Exploring Spatial Intelligence from a Generative Perspective

Spatial intelligence is essential for multimodal large language models, yet current benchmarks largely assess it only from an understanding perspective....

Exploring Two-Phase Continual Instruction Fine-tuning for Multilingual Adaptation in Large Language Models.

Exploring Two-Phase Continual Instruction Fine-tunin... — published at ACL 2026.

Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation

Few-step generation has been a long-standing goal, with recent one-step generation methods exemplified by MeanFlow achieving remarkable results. Existin...

Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering

Large language model (LLM) agents are increasingly built less by changing model weights than by reorganizing the runtime around them. Capabilities that...

FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation of Large Language Models.

FACT-AUDIT: An Adaptive Multi-Agent Framework for Dy... - published at ACL 2025.

FactReview: Evidence-Grounded Reviews with Literature Positioning and Execution-Based Claim Verification

Peer review in machine learning is under growing pressure from rising submission volume and limited reviewer time. Most LLM-based reviewing systems read...

Factuality and Hallucination Evaluation

Measuring hallucination rates in open-source LLMs - TruthfulQA, FActScore, RAGAs factuality, entity verification, and building automated hallucination detection pipelines for production RAG systems.

Fairness under Graph Uncertainty: Achieving Interventional Fairness with Partially Known Causal Graphs over Clusters of Variables

Algorithmic decisions about individuals require predictions that are not only accurate but also fair with respect to sensitive attributes such as gender...

Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization

Multimodal reasoning models (MRMs) trained with reinforcement learning with verifiable rewards (RLVR) show improved accuracy on visual reasoning benchma...

Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth

Chains of thought (CoTs) have become central in interpreting and auditing behaviors of large language models. Yet growing evidence suggests that these t...

False Friends or Cognates? A Cross-lingual Semantic Ambiguity Evaluation for Galician, Portuguese and Spanish.

False Friends or Cognates? A Cross-lingual Semantic... — published at ACL 2026.

FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization

Human-centric video customization, particularly at the garment level, has shown significant commercial value. However, existing approaches cannot suppor...

Fast Byte Latent Transformer

Recent byte-level language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their utility is limite...

Fast Spatial Memory with Elastic Test-Time Training

Large Chunk Test-Time Training (LaCT) has shown strong performance on long-context 3D reconstruction, but its fully plastic inference-time updates remai...

Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving

End-to-end autonomous driving via Vision-Language-Action (VLA) models demands a precarious balance between high-fidelity trajectory planning and efficie...

FastAPI - Type-Driven APIs with Automatic Validation and Docs

Master FastAPI at engineering depth - ASGI foundations, Pydantic validation, dependency injection, middleware, response models, background tasks, testing, and router organisation for production APIs.

FastKernels: Benchmarking GPU Kernel Generation in Production

LLM-based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize agains...

Fault Tolerance in Large Cluster Training

Why fault tolerance is critical at scale, how to design checkpointing strategies, detect stragglers, handle spot preemptions, and recover from failures without restarting multi-week training runs.

FaultXformer: A Transformer-Encoder Based Fault Classification and Location Identification model in PMU-Integrated Active Electrical Distribution System

Accurate fault detection and localization in electrical distribution systems is crucial, especially with the increasing integration of distributed energ...

Feature Consistency

Ensuring identical features between training (offline) and serving (online).

Feature Engineering at Scale

How to redesign feature engineering pipelines for distributed compute when a 10 GB solution fails at 500 GB.

Feature Engineering at Scale - The 80% of ML Work That Determines 80% of Results

How to build feature pipelines that work identically in training and serving - feature stores, point-in-time joins, crossing, embedding lookup, and avoiding training-serving skew.

Feature Importance and SHAP

Master all three feature importance types, TreeSHAP for exact Shapley values, SHAP interaction values, feature selection with SHAP, data leakage detection, fairness analysis, and production importance drift monitoring.

Feature Importance Methods - Beyond SHAP

Permutation importance, impurity-based importance, partial dependence plots, ALE, H-statistics, Sobol indices, and production monitoring - the complete toolkit for understanding which features drive your model's decisions, and when each method lies to you.

Feature Monitoring

Detecting feature drift, staleness, and coverage gaps in production.

Feature Monitoring in Production

Monitoring features after deployment - PSI, KS tests, freshness monitoring, completeness tracking, and proving to a regulator that no feature drifted more than 10% PSI.

Feature Platform

Build a shared feature platform that eliminates cross-team feature duplication, ensures training-serving consistency, and serves fresh features at millisecond latency.

Feature Selection and Importance

Reducing 500 features to 50 without losing model performance - filter, wrapper, and embedded methods, SHAP-based selection, and leakage detection.

Feature Store Architecture

How feature stores solve training-serving skew with a dual-store architecture - offline store for training, online store for serving, and point-in-time correct retrieval.

Feature Stores in Production

Architecture and operations of feature stores - offline and online layers, point-in-time joins, and avoiding the training-serving skew that costs you accuracy.

Feature Validation and Testing

Ensuring feature quality through schema validation, unit tests, integration tests, and monitoring - catching the NaN bug before it degrades your model for 3 weeks.

Federated Learning in Healthcare

Training ML models across hospital systems without sharing patient data - FedAvg algorithm, differential privacy, non-IID data challenges, NVIDIA FLARE, and practical multi-hospital federated learning with Flower.

FedMental: Evaluating Federated Learning for Mental Health Detection from Social Media Data.

FedMental: Evaluating Federated Learning for Mental... — published at ACL 2026.

Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective

Reconstructing 3D representations from 2D inputs is a fundamental task in computer vision and graphics, serving as a cornerstone for understanding and i...

Feed-Forward Layers

The role of position-wise feed-forward networks in transformers - from the basic FFN to SwiGLU and Mixture of Experts.

Feedback Collection for LLM Systems

Build production-grade feedback collection systems for AI products - explicit signals, implicit behavioral signals, data schemas, bias mitigation, and closed-loop improvement pipelines.

Feedback Loops and Data Flywheels

How recommendation systems create self-reinforcing feedback loops, how to detect them, and how inverse propensity weighting and exploration strategies break them to enable unbiased learning.

Feedback Loops and the Data Flywheel - How ML Systems Compound Over Time

A deep dive into feedback loop design, concept drift detection, retraining strategies, and building data flywheels that make ML systems continuously improve in production.

Few-Shot Learning and Chain-of-Thought Prompting

Master few-shot example selection, chain-of-thought reasoning, self-consistency decoding, and when to use each technique for reliable LLM outputs.

Few-Shot Prompting

Master in-context learning by providing carefully selected examples that demonstrate the exact behavior you want - without any model fine-tuning.

File Systems and IO Patterns

Master Linux file systems for ML workloads - VFS, ext4/XFS, page cache, direct I/O, mmap, io_uring, and how to tune I/O for maximum training data throughput and checkpoint speed.

FileGram: Grounding Agent Personalization in File-System Behavioral Traces

Coworking AI agents operating within local file systems are rapidly emerging as a paradigm in human-AI interaction; however, effective personalization r...

Fine-Tuning Cost and ROI Analysis

Making the business case for LLM fine-tuning - calculating GPU compute costs, estimating break-even against API pricing, and deciding when fine-tuning beats prompt engineering on ROI.

Fine-Tuning Diffusion Models - DreamBooth, LoRA, Textual Inversion, ControlNet

How to teach Stable Diffusion new concepts with as few as 5-20 images - covering Textual Inversion, DreamBooth, LoRA, ControlNet, and IP-Adapter with full training code, hyperparameter guidance, and evaluation strategies.

Fine-Tuning Embedding Models for Your Domain

Contrastive fine-tuning with triplet loss, hard negative mining, in-batch negatives, synthetic data generation, TSDAE, GPL, and a full worked example on domain adaptation.

Fine-Tuning Hyperparameter Search

Systematic hyperparameter optimization for LLM fine-tuning - learning rate, batch size, epochs, LoRA rank, warmup schedules, and efficient search strategies with Optuna and WandB sweeps.

Fine-Tuning Ops

Operationalize LLM fine-tuning at scale - data pipelines, LoRA adapter management, adapter registries, and serving 50 customer-specific adapters efficiently.

Fine-Tuning Pipelines

End-to-end fine-tuning pipeline engineering - from data collection and curation to training, evaluation, and deployment. When to fine-tune vs RAG vs prompt engineering, and how to build the pipeline that makes it repeatable and production-safe.

Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models

Transformer-based large language models exhibit in-context learning, enabling adaptation to downstream tasks via few-shot prompting with demonstrations....

Finite Difference Flow Optimization for RL Post-Training of Text-to-Image Models

Reinforcement learning (RL) has become a standard technique for post-training diffusion-based image synthesis models, as it enables learning from reward...

FIT: A Large-Scale Dataset for Fit-Aware Virtual Try-On

Given a person and a garment image, virtual try-on (VTO) aims to synthesize a realistic image of the person wearing the garment, while preserving their...

Five Pillars of Data Observability for ML Systems

What freshness, distribution, volume, schema, and lineage tracking do for AI systems, when silent data drift and pipeline failures silently corrupt model inputs and degrade predictions, and how to instrument these five pillars in production AI data pipelines.

Fixed-Budget Constrained Best Arm Identification in Grouped Bandits

We study fixed budget constrained best-arm identification in grouped bandits, where each arm consists of multiple independent attributes with stochastic...

FL-MHSM: Spatially-adaptive Fusion and Ensemble Learning for Flood-Landslide Multi-Hazard Susceptibility Mapping at Regional Scale

Existing multi-hazard susceptibility mapping (MHSM) studies often rely on spatially uniform models, treat hazards independently, and provide limited rep...

Flash Attention Kernel Deep Dive

How FlashAttention rewrites the attention mechanism to never materialize the N x N matrix in HBM, the online softmax tiling algorithm, IO complexity analysis, and FlashAttention 2 and 3 improvements.

Flash-BoN: Instant Drafts for Inference-Time Scaling in Diffusion Models

Inference-time scaling for text-to-image generation has progressed from simple Best-of-N (BoN) sampling to guided search methods that verify and steer c...

Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization

Group Relative Policy Optimization has emerged as essential for aligning video diffusion models with human preferences, but faces a critical computation...

FlashOptim: Optimizers for Memory Efficient Training

Standard mixed-precision training of neural networks requires many bytes of accelerator memory for each model parameter. These bytes reflect not just th...

FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption

Long-context large language models (LLMs)-for example, Gemini-3.1-Pro and Qwen-3.5-are widely used to empower many real-world applications, such as retr...

Flask - Building REST APIs the Right Way

Master Flask at engineering depth - application factory pattern, request context proxies, routing, Blueprints, error handlers, testing with test_client, configuration management, and the extension ecosystem for building production-grade REST APIs.

Flex-Forcing: Towards a Unified Autoregressive and Bidirectional Video Diffusion Model

Recent progress in large-scale generative models has substantially advanced video generation, yet existing methods remain constrained by a rigid inferen...

FlexiTac: A Low-Cost, Open-Source, Scalable Tactile Sensing Solution for Robotic Systems

We present FlexiTac, a low-cost, open-source, and scalable piezoresistive tactile sensing solution designed for robotic end-effectors. FlexiTac is a pra...

Flow Matching is Adaptive to Manifold Structures

Flow matching has emerged as a simulation-free alternative to diffusion-based generative modeling, producing samples by solving an ODE whose time-depend...

Flow-ERD: Agent-type Aware Flow Matching with Entropy-Regularized Distillation for Diverse Traffic Simulation

Realistic and diverse traffic simulation is essential to autonomous driving development. Yet prevailing benchmarks predominantly reward realism, and rec...

Flow-OPD: On-Policy Distillation for Flow Matching Models

Existing Flow Matching (FM) text-to-image models suffer from two critical bottlenecks under multi-task alignment: the reward sparsity induced by scalar-...

FlowAnchor: Stabilizing the Editing Signal for Inversion-Free Video Editing

We propose FlowAnchor, a training-free framework for stable and efficient inversion-free, flow-based video editing. Inversion-free editing methods have...

FlowInOne:Unifying Multimodal Generation as Image-in, Image-out Flow Matching

Multimodal generation has long been dominated by text-driven pipelines where language dictates vision but cannot reason or create within it. We challeng...

Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference

The quadratic computational complexity of standard attention mechanisms presents a severe scalability bottleneck for LLMs in long-context scenarios. Whi...

Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models

Autoregressive (AR) video diffusion models adopt a streaming generation framework, enabling long-horizon video generation with real-time responsiveness,...

FoReco and FoRecoML: A Unified Toolbox for Forecast Reconciliation in R

Forecast reconciliation has become key to improving the accuracy and coherence of forecasts for linearly constrained multiple time series, such as hiera...

Forge-UGC: FX optimization and register-graph engine for universal graph compiler

We present Forge-UGC (FX Optimization and Register-Graph Engine for Universal Graph Compilation), a four-phase compiler for transformer deployment on he...

FORGE:Fine-grained Multimodal Evaluation for Manufacturing Scenarios

The manufacturing sector is increasingly adopting Multimodal Large Language Models (MLLMs) to transition from simple perception to autonomous execution,...

Foundation Protocol: A Coordination Layer for Agentic Society

Autonomous agents are moving from tools into a layer of social infrastructure: they browse, purchase, deploy software, manage systems, and increasingly...

Foundational CS for ML Engineers

The computer science foundations that make ML engineers dangerous - CPU and GPU architecture, operating systems, compilers, memory management, networking, algorithms, and systems programming.

Four Types of Agent Memory

Cognitive science meets AI engineering: working, episodic, semantic, and procedural memory implemented in production agent systems.

FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

Reinforcement-Learning-based post-training has recently emerged as a promising paradigm for aligning text-to-image diffusion models with human preferenc...

FPGAs for AI Inference

How FPGAs enable sub-microsecond AI inference - reconfigurable logic, HLS programming, Xilinx Vitis AI, quantization strategies, and when FPGAs beat GPUs for latency-critical deployments.

Fractals made Practical: Denoising Diffusion as Partitioned Iterated Function Systems

What is a diffusion model actually doing when it turns noise into a photograph? We show that the deterministic DDIM reverse chain operates as a Partitio...

Frame Theoretical Derivation of Three Factor Learning Rule for Oja's Subspace Rule

We show that the error-gated Hebbian rule for PCA (EGHR-PCA), a three-factor learning rule equivalent to Oja's subspace rule under Gaussian inputs, can...

Framework Comparison

Comprehensive comparison of LangGraph, CrewAI, AutoGen, LlamaIndex, and raw API across 12 production dimensions - with decision flowchart and real case studies.

Framing ML Problems - Turning Business Goals into Training Objectives

Learn how to translate ambiguous business goals into precise ML objectives - the most critical and most overlooked skill in ML system design.

Frankenmodels and Limitations of Model Merging

Layer grafting, depth upscaling, Solar 10.7B, and the fundamental limits of what model merging can and cannot achieve.

FRAPPE: Full Input, Residual Output Autoencoding with Projection Pursuit Encoder

Media compression standards have reached a plateau in terms of the rate-distortion-complexity trade-off, limiting the ability to offload expensive AI pe...

Fraud Detection Systems

Real-time payment fraud detection at Stripe scale - rule-based baselines, graph fraud detection, session-level features, adversarial robustness, and false positive cost analysis.

Fraud Type Decomposition and the Observation-Mechanism Taxonomy:Class-Specific Detection Limits in Payment Networks

Fraud detection in payment networks relies on labels generated through heterogeneous and imperfect observation processes, yet existing approaches treat...

Free Geometry: Refining 3D Reconstruction from Longer Versions of Itself

Feed-forward 3D reconstruction models are efficient but rigid: once trained, they perform inference in a zero-shot manner and cannot adapt to the test s...

From Context to Skills: Can Language Models Learn from Context Skillfully?

Many real-world tasks require language models (LMs) to reason over complex contexts that exceed their parametric knowledge. This calls for context learn...

From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World

AI pentesting agents are increasingly credible as offensive security systems, but current benchmarks still provide limited guidance on which will perfor...

From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes.

From Feedback to Checklists: Grounded Evaluation of... - published at EMNLP 2025.

From Foundation to Application: Improving VLA Models in Practice

Despite recent progress of VLA foundation models, the disparity between laboratory conditions and real-world applications continues to impede their prac...

From Human-Centric to Agentic Code Review: The Impact of Different Generations of Generative AI Technology on Review Quality

Code review helps maintain software quality before code integration, but it also imposes a substantial workload on human reviewers. As generative artifi...

From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding.

From Long Videos to Engaging Clips: A Human-Inspired... - published at EMNLP 2025.

From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering

Existing tampering detection benchmarks largely rely on object masks, which severely misalign with the true edit signal: many pixels inside a mask are u...

From Model Scaling to System Scaling: Scaling the Harness in Agentic AI

This paper studies the next major bottleneck in agentic AI as system scaling, not only model scaling: the design of auditable, persistent, modular, and...

From Noisy Traces to Root Causes: Structural Trajectory Analysis and Causal Extraction for Agent Optimization

The optimization of long-horizon agents increasingly relies on reflection-based mechanisms, where a large language model (LLM) acts as an optimizer to d...

From P(y|x) to P(y): Investigating Reinforcement Learning in Pre-train Space

While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its...

From Paper to Structured JSON: An Agentic AI Workflow for Compliant BMR Digital Transformation.

From Paper to Structured JSON: An Agentic AI Workflo... - published at EACL 2026.

From Pixels to States: Rethinking Interactive World Models as Game Engines

Building interactive worlds that respond coherently to player actions has long been a shared goal of computer graphics, games, and artificial intelligen...

From Pixels to Words -- Towards Native One-Vision Models at Scale

Current vision-language models (VLMs) typically stitch together separate image encoders and language decoders via multi-stage alignment, a modular frame...

From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors

LLM agents are evolving from conversational chatbots to operational tools in real-world workspaces. In local agentic harnesses, an LLM can read and writ...

From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models

Reinforcement learning (RL) for large language models (LLMs) increasingly relies on sparse, outcome-level rewards -- yet determining which actions withi...

From RGB Generation to Dense Field Readout: Pixel-Space Dense Prediction with Text-to-Image Models

Large-scale text-to-image models are attractive backbones for dense prediction because RGB generation pretraining learns rich semantic, structural, and...

From Shallow Bayesian Neural Networks to Gaussian Processes: General Convergence, Identifiability and Scalable Inference

In this work, we study scaling limits of shallow Bayesian neural networks (BNNs) via their connection to Gaussian processes (GPs), with an emphasis on s...

From Storage to Experience: A Survey on the Evolution of LLM Agent Memory Mechanisms

Large Language Model (LLM)-based agents have fundamentally reshaped artificial intelligence by integrating external tools and planning capabilities. Whi...

FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale

Many real-world coding challenges are open-ended and admit no known optimal solution. Yet, recent progress in LLM coding has focused on well-defined tas...

Full Fine-Tuning vs PEFT

Decision framework for choosing between full fine-tuning and parameter-efficient methods like LoRA and QLoRA - covering compute requirements, quality ceilings, catastrophic forgetting, and when each approach wins.

Full Fine-Tuning vs PEFT: Decision Framework

A practical decision framework for choosing between full fine-tuning, LoRA, QLoRA, prompt tuning, and other PEFT methods based on your model size, data, and quality requirements.

Function-Aware Fill-in-the-Middle as Mid-Training for Coding Agent Foundation Models

Coding agents must integrate external tool returns into ongoing reasoning - a capability that standard left-to-right pretraining on code exposes only in...

Function2Scene: 3D Indoor Scene Layout from Functional Specifications

Most text-driven 3D indoor scene synthesis methods generate rooms from object-centric prompts, asking what furniture should be placed rather than how th...

Functional Attention: From Pairwise Affinities to Functional Correspondences

Learning mappings between infinite-dimensional function spaces, or operator learning, is essential for many machine learning applications. Although tran...

GADFA: Generator-Assisted Decision-Focused Approach for Opinion Expressing Timing Identification.

GADFA: Generator-Assisted Decision-Focused Approach... - published at COLING 2025.

GAIA Benchmark

GAIA tests general-purpose agents on real-world tasks requiring web search, file reading, code execution, and multi-step reasoning. Learn the task structure, scoring, SOTA analysis, and how to build GAIA-style evaluations.

GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents

Towards an embodied generalist for real-world interaction, Multimodal Large Language Model (MLLM) agents still suffer from challenging latency, sparse f...

Gamma-World: Generative Multi-Agent World Modeling Beyond Two Players

World models for interactive video generation have largely focused on single-agent settings, where future observations are generated from a single contr...

Garbage Collection - Generational GC, Cycle Detection, and Memory Leak Diagnosis

Master CPython's cyclic garbage collector at engineering depth - generational collection, three generations, cycle detection algorithm, gc module API, __del__ and PEP 442, gc.freeze() for fork, gc.get_referrers() for leak diagnosis, and common memory leak patterns.

Garbage Collection Algorithms

How Python's reference counting and generational garbage collector work, why GC pauses hurt ML serving latency, and how to tune or disable GC for performance-critical workloads.

Gaussian Processes - Non-Parametric Bayesian Regression with Calibrated Uncertainty

Gaussian processes provide a full distribution over functions with principled uncertainty estimates - how they work, kernel engineering, and when to use them over neural networks.

GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers

The autonomous discovery of bugs remains a significant challenge in modern software development. Compared to code generation, the complexity of dynamic...

GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models

Reinforcement learning (RL) can be used to improve the policy (denoiser) of diffusion large language models (dLLMs), while being hindered by the intract...

GE-Sim 2.0: A Roadmap Towards Comprehensive Closed-loop Video World Simulators for Robotic Manipulation

We introduce GE-Sim 2.0 (Genie Envisioner World Simulator 2.0), a closed-loop video world simulator for robotic manipulation. Building on the action-con...

GEM: Generative Supervision Helps Embodied Intelligence

Embodied Vision-Language Models (VLMs) have demonstrated impressive performance and generalization in robotics, particularly within Vision-Language-Acti...

Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini

We introduce Gemini Embedding 2, a native multimodal embedding model that allows embedding video, audio, image, and text modalities in a unified represe...

Gemma 4 Technical Report

We introduce Gemma 4, a new generation of open-weight, natively multimodal language models in the Gemma model family. Designed to advance compute effici...

GenClaw: Code-Driven Agentic Image Generation

Image generation models have evolved from text-conditioned pixel synthesis toward multimodal agents endowed with visual comprehension and tool invocatio...

General Bayesian Policy Learning

This study proposes the General Bayes framework for policy learning. We consider decision problems in which a decision-maker chooses an action from an a...

General Multimodal Protein Design Enables DNA-Encoding of Chemistry

Evolution is an extraordinary engine for enzymatic diversity, yet the chemistry it has explored remains a narrow slice of what DNA can encode. Deep gene...

General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks

Contemporary large language models (LLMs) have demonstrated remarkable reasoning capabilities, particularly in specialized domains like mathematics and...

Generalization and Scaling Laws for Mixture-of-Experts Transformers

We develop a theory of generalization and scaling for Mixture-of-Experts (MoE) Transformers that cleanly separates \emph{active} per-input capacity from...

Generalization Properties of Score-matching Diffusion Models for Intrinsically Low-dimensional Data

Despite the remarkable empirical success of score-based diffusion models, their statistical guarantees remain underdeveloped. Existing analyses often pr...

Generalized Linear Models

Understand the GLM framework - link functions, exponential family distributions, Poisson regression for count data, Gamma regression for positive continuous targets, IRLS algorithm, overdispersion, and deviance-based model comparison.

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

Reinforcement learning (RL) has become a central post-training tool for improving the reasoning abilities of large language models (LLMs). In these syst...

Generating DDPM-based Samples from Tilted Distributions

Given $n$ independent samples from a $d$-dimensional probability distribution, our aim is to generate diffusion-based samples from a distribution obtain...

Generating Multi-Aspect Queries for Conversational Search.

Generating Multi-Aspect Queries for Conversational S... - published at EACL 2026.

Generating Statistical Charts with Validation-Driven LLM Workflows

Generating diverse, readable statistical charts from tabular data remains challenging for LLMs, as many failures become apparent after rendering and are...

Generative Adversarial Networks - From the Original GAN to StyleGAN

The complete story of GANs - from Goodfellow's 2014 minimax formulation to DCGAN, Wasserstein GAN, Progressive GAN, and StyleGAN2 - including training instabilities, theoretical foundations, and why diffusion models eventually surpassed them.

Generative Compilation: On-the-Fly Compiler Feedback as AI Generates Code

Languages with rich static semantics, such as Rust, provide stronger guarantees for AI-generated code, but their strictness makes generation more diffic...

Generative Modeling with Orbit-Space Particle Flow Matching

We present Orbit-Space Geometric Probability Paths (OGPP), a particle-native flow-matching framework for generative modeling of particle systems. OGPP i...

Generative Models Overview - VAEs, GANs, Flow Models, and Diffusion

A unified view of generative modeling approaches - how VAEs, GANs, normalizing flows, energy-based models, and diffusion models each define a different way to learn a distribution, with trade-offs in quality, diversity, training stability, and likelihood.

Generative Quantum-inspired Kolmogorov-Arnold Eigensolver

High-performance computing (HPC) is increasingly important for scalable quantum chemistry workflows that couple classical generative models, quantum cir...

Generative Refinement Networks for Visual Synthesis

While diffusion models dominate the field of visual generation, they are computationally inefficient, applying a uniform computational effort regardless...

Generators and yield - Suspended Execution at Engineering Depth

Understand Python generators and yield at engineering depth - frame suspension, the generator state machine, send() and the coroutine protocol, yield from, throw() and close(), memory-efficient pipelines, and the foundation of async/await.

GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0)

Long-horizon large language model (LLM) agents are fundamentally limited by context. As interactions become longer, tool descriptions, retrieved memorie...

Generics and TypeVar

Master generic programming in Python with TypeVar, Generic base class, bound and constrained type variables, covariance vs contravariance vs invariance, and real-world patterns from FastAPI and SQLAlchemy.

GenLCA: 3D Diffusion for Full-Body Avatars from In-the-Wild Videos

We present GenLCA, a diffusion-based generative model for generating and editing photorealistic full-body avatars from text and image inputs. The genera...

Genomics and Protein Folding

AI for genomics and protein science - AlphaFold 2 architecture, variant calling, polygenic risk scores, DNA language models, and practical protein structure prediction with ESMFold.

GeoChemAD: Benchmarking Unsupervised Geochemical Anomaly Detection for Mineral Exploration

Geochemical anomaly detection plays a critical role in mineral exploration as deviations from regional geochemical baselines may indicate mineralization...

Geometric coherence of single-cell CRISPR perturbations reveals regulatory architecture and predicts cellular stress

Genome engineering has achieved remarkable sequence-level precision, yet predicting the transcriptomic state that a cell will occupy after perturbation...

Geometric Context Transformer for Streaming 3D Reconstruction

Streaming 3D reconstruction aims to recover 3D information, such as camera poses and point clouds, from a video stream, which necessitates geometric acc...

Geometric regularization of autoencoders via observed stochastic dynamics

Stochastic dynamical systems with slow or metastable behavior evolve, on long time scales, on an unknown low-dimensional manifold in high-dimensional am...

Geometry Matters: 3D Foundation Priors for Learning Semantic Correspondence

Foundation features from self-supervised vision models and text-to-image diffusion models have proven effective for semantic correspondence estimation....

Geometry-Aware Image Flow Matching

Recent advances in generative models highlight the power of geometry-aware modeling in manifold-constrained settings. Yet, for natural images, the field...

Geometry-Aware Representation Denoising for Robust Multi-view 3D Reconstruction

Multi-view 3D reconstruction has achieved remarkable progress with the advent of feed-forward 3D reconstruction models. However, these models are typica...

GeoStack: A Framework for Quasi-Abelian Knowledge Composition in VLMs

We address the challenge of knowledge composition in Vision-Language Models (VLMs), where accumulating expertise across multiple domains or tasks typica...

GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

Large language models are typically post-trained using supervised fine-tuning (SFT) and reinforcement learning (RL), yet effectively unifying efficient...

GGT-100K: Generative Ground Truth for Generalizable Real-World Image Restoration

Real-world image restoration (IR) is bottlenecked by the scarcity of high-quality paired training data. Synthetic datasets are abundant but often fail t...

GigaWorld-Policy-0.5: A Faster and Stronger WAM Empowered by AutoResearch

World Action Models (WAMs) improve robot policy learning by jointly modeling actions and future visual observations, using future scene evolution as den...

GitHub Actions for ML

Build a complete ML CI pipeline in GitHub Actions that triggers training only when training data or model code changes - not on every commit.

GitLab CI for ML

Build an enterprise-grade ML CI/CD pipeline in GitLab CI - from data commit to production deployment with DAG pipelines, GPU runners, and environments.

GitOps for ML

Apply GitOps principles to ML infrastructure - Flux CD, ArgoCD, image update automation, secrets management, and PR-gated model deployments with Argo Rollouts.

Giving Sensors a Voice: Multimodal JEPA for Semantic Time-Series Embeddings

Transformer-based architectures have advanced sequence modeling in language and vision, yet general-purpose representation learning for heterogeneous mu...

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

We present GLM-5V-Turbo, a step toward native foundation models for multimodal agents. As foundation models are increasingly deployed in real environmen...

Global Interpretability via Automated Preprocessing: A Framework Inspired by Psychiatric Questionnaires

Psychiatric questionnaires are highly context sensitive and often only weakly predict subsequent symptom severity, which makes the prognostic relationsh...

Global Optimality for Constrained Exploration via Penalty Regularization

Efficient exploration is a central problem in reinforcement learning and is often formalized as maximizing the entropy of the state-action occupancy mea...

Global, Shared, and Register Memory

Master the five CUDA memory spaces - registers, shared memory, L1/L2 cache, and global memory - with real latency numbers, tiled matrix multiply, and the patterns that separate 8% bandwidth utilization from 85%.

GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens

The efficient spatial allocation of primitives serves as the foundation of 3D Gaussian Splatting, as it directly dictates the synergy between representa...

GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts

Optical character recognition (OCR) has advanced rapidly with the rise of vision-language models, yet evaluation has remained concentrated on a small cl...

GNNs for Recommender Systems

How LightGCN, PinSage, and NGCF use graph neural networks on user-item interaction graphs to capture multi-hop collaborative filtering signals at billion-scale.

GO-GenZip: Goal-Oriented Generative Sampling and Hybrid Compression

Current network data telemetry pipelines consist of massive streams of fine-grained Key Performance Indicators (KPIs) from multiple distributed sources...

Goal-Driven Data Story, Narrations and Explanations.

Goal-Driven Data Story, Narrations and Explanations. - published at NAACL 2025.

Google BigQuery

BigQuery architecture, ML built-in functions, and BigQuery ML.

Google TPU Architecture

Deep dive into Google's Tensor Processing Units - systolic array design, XLA compilation, TPU pod topology, and how to write high-performance JAX programs that avoid recompilation traps.

Google Vertex AI for MLOps

Master the complete Google Vertex AI platform for end-to-end ML workflows - Pipelines, Training, Prediction, Feature Store, Model Registry, Experiments, and production deployment on GCP.

GPTQ In Depth

A deep technical walkthrough of the GPTQ algorithm - Optimal Brain Surgeon derivation, layer-by-layer quantization, group quantization, actorder, and practical deployment with AutoGPTQ and vLLM.

GPTQ: Post-Training Quantization

GPTQ explained from first principles - how Hessian-based error compensation quantizes 175B models to 4-bit in hours, the role of calibration data, group size, activation reordering, and how to deploy GPTQ models in production with vLLM and autoGPTQ.

GPU Architecture for ML Engineers

Understand CUDA cores vs Tensor Cores, GPU memory hierarchy, FLOPS vs memory bandwidth, the roofline model, warp execution, and NVLink - the hardware knowledge that drives ML optimization.

GPU Cluster Networking

InfiniBand vs RoCE vs Ethernet for GPU cluster communication, fat-tree and rail-optimized topologies, GPUDirect RDMA, SHARP in-network aggregation, and diagnosing collective communication bottlenecks in production ML clusters.

GPU Containers

Build and run GPU-enabled containers for ML - covering NVIDIA Container Toolkit, CUDA compatibility, Kubernetes GPU scheduling, and debugging GPU access.

GPU Cost Optimization

Systematically reduce GPU infrastructure costs with spot instances, GPU sharing via MPS and MIG, right-sizing, reserved instances, efficient batching, utilization monitoring, and GPU marketplace strategies.

GPU Inference vs Training Requirements

Why inference and training have fundamentally different GPU hardware requirements, covering compute vs memory-bandwidth bottlenecks, the prefill/decode split, and how to select the right GPU for serving.

GPU Memory Hierarchy Deep Dive

Complete GPU memory hierarchy - registers, L1/shared memory, L2 cache, and HBM - capacity, bandwidth, latency at each level, and how data flows through the hierarchy during kernel execution.

GPU Memory Management

Master VRAM capacity planning, activation checkpointing, mixed precision training, ZeRO optimizer stages, CPU offloading, and OOM debugging for production ML workloads.

GPU Scheduling in Kubernetes

GPU resource management in Kubernetes - NVIDIA device plugin, MIG, time-slicing, node affinity, GPU quotas per namespace, and DCGM monitoring for ML clusters.

GPU vs CPU Architecture

Why GPUs dominate deep learning - SIMT execution model, throughput vs latency optimization, the fundamental design tradeoffs between CPU and GPU silicon.

Gradient Boosting From Scratch

Understand gradient boosting from first principles - additive models, functional gradient descent, pseudo-residuals for any loss function, shrinkage, stochastic boosting, and bias-variance tradeoffs versus Random Forest.

Gradient Boosting within a Single Attention Layer

Transformer attention computes a single softmax-weighted average over values -- a one-pass estimate that cannot correct its own errors. We introduce \em...

Gradient Checkpointing and Rematerialization

Activation checkpointing to reduce training memory usage, sublinear memory algorithm, selective checkpointing strategies, and implementation in PyTorch and JAX.

Gradient Descent From Scratch

Implement gradient descent for linear regression from first principles - derive the gradient, analyze the loss landscape, understand learning rate via Lipschitz constants, implement momentum, gradient clipping, and convergence analysis via condition number.

Gradient Flow Polarizes Softmax Outputs towards Low-Entropy Solutions

Understanding the intricate non-convex training dynamics of softmax-based models is crucial for explaining the empirical success of transformers. In thi...

Gradient Regularized Newton Boosting Trees with Global Convergence

Gradient Boosting Decision Trees (GBDTs) dominate tabular machine learning, with modern implementations like XGBoost, LightGBM, and CatBoost being based...

GradSentry: Gradient Spectral Entropy for Backdoor Sample Filtering in Large Language Model Fine-Tuning

Fine-tuning Large Language Models with untrusted data exposes models to backdoor attacks, where poisoned samples cause targeted misbehavior. Existing sa...

GRAM: Generative Recommendation via Semantic-aware Multi-granular Late Fusion.

GRAM: Generative Recommendation via Semantic-aware M... - published at ACL 2025.

Graph Algorithms and GNNs

Master graph representations, classical graph algorithms, and graph neural networks - from BFS/DFS and PageRank to GCN, GraphSAGE, and GAT with PyTorch Geometric.

Graph Attention Networks

GAT - learning which neighbors matter via attention over graph edges. Multi-head attention, GATv2's dynamic attention, heterophilic graphs, and training on Cora with PyTorch Geometric.

Graph Convolutional Networks

GCN derivation from spectral graph theory to efficient spatial message passing. Symmetric normalization, renormalization trick, over-smoothing, and training on Cora with PyG.

Graph of Skills: Dependency-Aware Structural Retrieval for Massive Agent Skills

Skill usage has become a core component of modern agent systems and can substantially improve agents' ability to complete complex tasks. In real-world s...

Graph RAG

Master Microsoft's Graph RAG - build knowledge graphs from documents, use community detection for global queries, and understand when graph structure beats flat vector search.

Graph Representation for ML

Node embeddings from shallow methods to GNNs - DeepWalk, Node2Vec, LINE, spectral embeddings, manual features, and their fundamental limitations. How to featurize nodes, edges, and graphs.

Graph-Based Chain-of-Thought Pruning for Reducing Redundant Reflections in Reasoning LLMs

Extending CoT through RL has been widely used to enhance the reasoning capabilities of LLMs. However, due to the sparsity of reward signals, it can also...

Graph-Informed Adversarial Modeling: Infimal Subadditivity of Interpolative Divergences

We study adversarial learning when the target distribution factorizes according to a known Bayesian network. For interpolative divergences, including $(...

GraphSAGE and Inductive Learning

GraphSAGE - sample and aggregate for inductive GNNs that generalize to unseen nodes. Neighbor sampling, mini-batch training, unsupervised learning, and PinSage for billion-scale recommendations.

GRASP: GRanularity-Aware Search Policy for Agentic RAG

Agentic retrieval-augmented generation (RAG) extends static RAG by allowing language models to iteratively reason, generate search queries, retrieve evi...

GrepSeek: Training Search Agents for Direct Corpus Interaction

Large Language Model (LLM) search agents have shown strong promise for knowledge-intensive language tasks through multiple rounds of reasoning and infor...

Grid2Matrix: Revealing Digital Agnosia in Vision-Language Models

Vision-Language Models (VLMs) excel on many multimodal reasoning benchmarks, but these evaluations often do not require an exhaustive readout of the ima...

Groq LPU Architecture

How Groq's Language Processing Unit eliminates the memory bottleneck for LLM inference by keeping model weights in on-chip SRAM and using deterministic compiler-scheduled execution.

gRPC and Protocol Buffers

Learn gRPC and Protocol Buffers for high-performance ML inference APIs - from protobuf wire format to bidirectional streaming, interceptors, health checks, and production deployment patterns.

GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows

The development of general-purpose agents requires a shift from executing simple instructions to completing complex, real-world productivity workflows....

Guardrails and Safety Systems

Build layered defense-in-depth safety systems for LLM applications - input filtering, toxicity detection, PII redaction, prompt injection defense, output validation, and human review escalation.

GUI Automation with Vision

Vision-based GUI automation for desktop applications - coordinate grounding, UI element detection, OCR integration, state tracking, and building a desktop automation agent.

GUI-CIDER: Mid-training GUI Agents via Causal Internalization and Density-aware Exemplar Reselection

Despite the rapid progress of multimodal large language models in building Graphical User Interface (GUI) agents, their real-world task completion is fu...

Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization

Group-advantage-based reinforcement learning methods, such as GRPO and DAPO, have demonstrated strong performance across diverse domains, including math...

Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders

Model internals encode rich information about how a large language model (LLM) processes its training data; however, post-training data engineering larg...

H-STAR: LLM-driven Hybrid SQL-Text Adaptive Reasoning on Tables.

H-STAR: LLM-driven Hybrid SQL-Text Adaptive Reasonin... - published at NAACL 2025.

Habitat-GS: A High-Fidelity Navigation Simulator with Dynamic Gaussian Splatting

Training embodied AI agents depends critically on the visual fidelity of simulation environments and the ability to model dynamic humans. Current simula...

Hallo4D: Multi-Modal Hallucination Mitigation for Consistent Spatio-Temporal Generation

While recent advances in 3D generation have enabled impressive visual synthesis, existing methods often rely on 2D diffusion supervision without explici...

Hallucination Risk in Legal AI

Why LLM hallucination is malpractice in legal contexts, grounding strategies, citation verification pipelines, and architecture patterns for trustworthy legal AI.

Hallucinations Undermine Trust; Metacognition is a Way Forward

Despite significant strides in factual reliability, errors -- often termed hallucinations -- remain a major concern for generative AI, especially as LLM...

Handling LLM Latency

Perceived latency, progressive rendering, streaming, prompt caching, and UX patterns for making slow AI responses feel fast.

Hardware Acceleration Beyond GPU

FPGA, ASIC, TPU systolic arrays, neuromorphic chips, photonic computing, and processing-in-memory for ML - when to use each, economic analysis, and the emerging hardware landscape beyond NVIDIA GPUs.

Hardware and Silicon for AI

GPU architecture, CUDA programming, custom silicon, kernel optimization, memory systems, and distributed training hardware - the layer below the framework that determines what is actually possible.

Hardware Performance Counters

Master hardware performance counters, the PMU, and Linux perf to diagnose CPU bottlenecks, optimize cache behavior, and profile ML workloads with surgical precision.

Hardware Requirements and Selection

How to select hardware for running LLMs locally - VRAM and RAM requirements by model size, GPU tier comparison, Apple Silicon analysis, CPU-only inference feasibility, and a practical hardware selection matrix.

Harness Handbook: Making Evolving Agent Harnesses Readable,Navigable, and Editable

The capability of a modern AI agent depends not only on its foundation model but also on its harness, which constructs prompts, manages state, invokes t...

Hash Tables and Bloom Filters

Deep dive into hash table internals, consistent hashing for distributed ML, Bloom filters for training data deduplication, MinHash LSH for near-duplicate detection, and fingerprinting for dataset versioning.

HBM and GDDR Memory Technologies

High Bandwidth Memory vs GDDR6X - how 3D stacking with Through-Silicon Vias enables HBM3 to deliver 3.35 TB/s on H100, why GDDR6X tops at 1 TB/s, the economics of each, and how memory bandwidth constrains LLM inference throughput.

HDP: A Lightweight Cryptographic Protocol for Human Delegation Provenance in Agentic AI Systems

Agentic AI systems increasingly execute consequential actions on behalf of human principals, delegating tasks through multi-step chains of autonomous ag...

HDR Video Generation via Latent Alignment with Logarithmic Encoding

High dynamic range (HDR) imagery offers a rich and faithful representation of scene radiance, but remains challenging for generative models due to its m...

Healthcare AI GYM for Medical Agents

Clinical reasoning demands multi-step interactions -- gathering patient history, ordering tests, interpreting results, and making safe treatment decisio...

Heap and Stack Memory

Learn how stack frames, heap allocation, and Python's memory model work under the hood - from C struct padding to pymalloc arenas, with production debugging techniques.

Heavy-Tailed and Long-Range Dependent Noise in Stochastic Approximation: A Finite-Time Analysis

Stochastic approximation (SA) is a fundamental iterative framework with broad applications in reinforcement learning and optimization. Classical analyse...

HeavySkill: Heavy Thinking as the Inner Skill in Agentic Harness

Recent advances in agentic harness with orchestration frameworks that coordinate multiple agents with memory, skills, and tool use have achieved remarka...

Helix4D: Complex 4D Mesh Generation

Current video-to-4D methods struggle with complex topology changes, transparent materials, thin structures, and inner surfaces. We present Helix4D, a dy...

Helm for ML Deployments

Helm charts for ML applications - chart anatomy, parameterizing ML deployments, environment values files, lifecycle hooks for model validation, and umbrella charts for multi-component stacks.

Heterogeneous Scientific Foundation Model Collaboration

Agentic large language model systems have demonstrated strong capabilities. However, their reliance on language as the universal interface fundamentally...

Hexagonal Architecture (Ports and Adapters)

Implement Hexagonal Architecture in Python using Protocol-based ports, swappable adapters, and clear boundaries between application logic and external systems.

Hide-and-Seek in Trajectories: Discovering Failure Signals for VLA Runtime Monitoring

Vision-Language-Action (VLA) models enable robots to follow natural language instructions and generalize across diverse tasks, but they remain vulnerabl...

Hierarchical Abstract Tree for Cross-Document Retrieval-Augmented Generation

Retrieval-augmented generation (RAG) enhances large language models with external knowledge, and tree-based RAG organizes documents into hierarchical in...

Hierarchical Clustering

Agglomerative and divisive hierarchical clustering - linkage criteria, dendrograms, cophenetic correlation, and production-scale strategies for discovering multi-scale data structure.

Hierarchical Denoising For Multi-Step Visual Reasoning

Video models are evolving into vision foundation models, yet they still lack human-like multi-step reasoning. Streaming autoregressive diffusion models...

Hierarchical Industrial Demand Forecasting with Temporal and Uncertainty Explanations

Hierarchical time-series forecasting is essential for demand prediction across various industries. While machine learning models have obtained significa...

Hierarchical Inference and Closure Learning via Adaptive Surrogates for ODEs and PDEs

Inverse problems are the task of calibrating models to match data. They play a pivotal role in diverse engineering applications by allowing practitioner...

Hierarchical Kernel Transformer: Multi-Scale Attention with an Information-Theoretic Approximation Analysis

The Hierarchical Kernel Transformer (HKT) is a multi-scale attention mechanism that processes sequences at L resolution levels via trainable causal down...

Hierarchical Planning with Latent World Models

Model predictive control (MPC) with learned world models has emerged as a promising paradigm for embodied control, particularly for its ability to gener...

Hierarchical Sparse Attention Done Right: Toward Infinite Context Modeling

Scaling modern large language models (LLMs) to long contexts is limited by the quadratic computation cost, and poor length extrapolation of dense attent...

Hierarchical SVG Tokenization: Learning Compact Visual Programs for Scalable Vector Graphics Modeling

Recent large language models have shifted SVG generation from differentiable rendering optimization to autoregressive program synthesis. However, existi...

High-dimensional Adaptive MCMC with Reduced Computational Complexity

We propose an adaptive MCMC method that learns a linear preconditioner which is dense in its off-diagonal elements but sparse in its parametrisation. Du...

High-dimensional Many-to-many-to-many Mediation Analysis

We study high-dimensional mediation analysis in which exposures, mediators, and outcomes are all multivariate, and both exposures and mediators may be h...

HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

Frontier coding agents solve complex tasks when given complete context but collapse when specifications are incomplete or ambiguous. The bottleneck is n...

Histopathology Image Normalization via Latent Manifold Compaction

Batch effects arising from technical variations in histopathology staining protocols, scanners, and acquisition pipelines pose a persistent challenge fo...

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often...

HMS-BERT: Hybrid Multi-Task Self-Training for Multilingual and Multi-Label Cyberbullying Detection

Cyberbullying on social media is inherently multilingual and multi-faceted, where abusive behaviors often overlap across multiple categories. Existing m...

HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction

Online 3D reconstruction requires estimating camera pose and scene geometry under strict causal and bounded-memory constraints. Existing methods often s...

How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

This paper localizes the policy routing mechanism in alignment-trained language models. An intermediate-layer attention gate reads detected content and...

How and What to Imagine? Visual Thinking in Unified Multimodal Models for Cross-View Spatial Reasoning

Cross-view spatial reasoning remains a weak spot for vision-language models (VLMs): they often reason in language and lose the fine-grained geometry nee...

How can embedding models bind concepts?

Humans easily determine which color belongs to which shape in multi-object scenes, an ability known as concept binding. Vision-language embedding models...

How Coding Agents Work

Deep dive into coding agent architecture: how agents navigate codebases, plan edits, execute changes, and iterate using test feedback.

How Context Shapes Truth: Geometric Transformations of Statement-level Truth Representations in LLMs.

How Context Shapes Truth: Geometric Transformations... — published at ACL 2026.

How Credible Is an Answer From Retrieval-Augmented LLMs? Investigation and Evaluation With Multi-Hop QA.

How Credible Is an Answer From Retrieval-Augmented L... - published at COLING 2025.

How Far Will They Go? Red-Teaming Online Influence with Large Language Models

As large language model (LLM)-based agents increasingly participate in online discourse, red-teaming their capacity to support political influence campa...

How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

Adapting reasoning models to new tasks during post-training with only output-level supervision stalls under reinforcement learning from verifiable rewar...

How LoRA Remembers? A Parametric Memory Law for LLM Finetuning

Large Language Models (LLMs) must continuously learn and update knowledge to remain effective in dynamic real-world environments. While Low-Rank Adaptat...

How Python Works Internally

A deep dive into CPython's architecture - from source code to bytecode execution, the GIL, memory management, and the Python object model that every serious Python engineer should understand.

How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data

A widely adopted strategy for model enhancement is to use synthetic data generated by a stronger model for supervised fine-tuning (SFT). However, for em...

How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

Agent skills, which are reusable, domain-specific knowledge artifacts, have become a popular mechanism for extending LLM-based agents, yet formally benc...

HP-Edit: A Human-Preference Post-Training Framework for Image Editing

Common image editing tasks typically adopt powerful generative diffusion models as the leading paradigm for real-world content editing. Meanwhile, altho...

HRBench: Benchmarking and Understanding Thinking-Mode Switch Strategies in Hybrid-Reasoning LLMs

Hybrid-reasoning large language models (LLMs) expose explicit controls over reasoning effort, allowing users or systems to trade off answer quality agai...

HSG: Hyperbolic Scene Graph

Scene graph representations enable structured visual understanding by modeling objects and their relationships, and have been widely used for multiview...

HTTP Deep Dive - What Actually Travels Over the Wire

Master HTTP/1.1 at the byte level - request/response wire format, method semantics, status code families, critical headers, connection pooling, the requests and httpx libraries, HTTP/2 multiplexing, and why every production client needs explicit timeouts.

HTTP/3 and QUIC

Understand HTTP/3 and QUIC - how QUIC solves TCP head-of-line blocking with UDP-based multiplexing, 0-RTT connection establishment, TLS 1.3 integration, and what it means for ML inference serving latency.

HuggingFace Ecosystem

Use the HuggingFace ecosystem end-to-end - transformers, datasets, Trainer API, PEFT/LoRA for efficient fine-tuning, the Hub for sharing models, and tokenizer internals.

HuggingFace Hub and Model Cards

Master the HuggingFace Hub as your primary interface for finding, evaluating, and deploying open-source models. Learn to read model cards, use the Hub API, and navigate 800k+ models efficiently.

Human Evaluation

Design rigorous human evaluation studies for LLMs - from annotation protocols to inter-annotator agreement to Chatbot Arena methodology.

Human Evaluation for Agents

When and how to run human evaluation for agentic systems - annotator selection, rubric design, inter-annotator agreement, crowdsourcing quality control, and closing the feedback loop.

Human Feedback Collection

Collecting preference data, thumbs ratings, and corrections for RLHF pipelines - preference interface design, feedback quality controls, DPO data formats, and ELO-based model ranking.

Human Oversight Mechanisms

Design human oversight that is meaningful, not performative - risk-based interruption, async approval queues, audit trails, and graduated autonomy.

HumanNet: Scaling Human-centric Video Learning to One Million Hours

Progress in embodied intelligence increasingly depends on scalable data infrastructure. While vision and language have scaled with internet corpora, lea...

HunyuanOCR-1.5: Making Lightweight OCR VLMs Faster and Better

We present HunyuanOCR-1.5, a lightweight end-to-end OCR-specialized vision-language model. HunyuanOCR unifies document parsing, text spotting, informati...

HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents

We introduce HY-Embodied-0.5, a family of foundation models specifically designed for real-world embodied agents. To bridge the gap between general Visi...

HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds

We introduce HY-World 2.0, a multi-modal world model framework that advances our prior project HY-World 1.0. HY-World 2.0 accommodates diverse input mod...

Hybrid Architectures - Jamba and Beyond

How combining attention and Mamba layers creates models that outperform pure architectures - Jamba's design, the attention-to-Mamba ratio, MoE integration, and the emerging hybrid landscape.

Hybrid Graphs for Table-and-Text based Question Answering using LLMs.

Hybrid Graphs for Table-and-Text based Question Answ... - published at NAACL 2025.

Hybrid Policy Distillation for LLMs

Knowledge distillation (KD) is a powerful paradigm for compressing large language models (LLMs), whose effectiveness depends on intertwined choices of d...

Hybrid Search - Dense and Sparse Retrieval

Combine BM25 keyword search with dense vector search using SPLADE, Reciprocal Rank Fusion, and learned sparse models to build retrieval systems that beat pure semantic search.

Hybrid Search and Reranking

How to combine BM25 sparse retrieval with dense vector search using Reciprocal Rank Fusion, and how to apply cross-encoder reranking for precision that neither method achieves alone.

Hybrid Search: Dense and Sparse

Combine BM25 sparse retrieval with dense vector search for best-of-both-worlds performance - understand SPLADE, fusion methods, and when hybrid beats pure dense.

HyCOP: Hybrid Composition Operators for Interpretable Learning of PDEs

We introduce HyCOP, a modular framework that learns parametric PDE solution operators by composing simple modules (advection, diffusion, learned closure...

Hyper Input Convex Neural Networks for Shape Constrained Learning and Optimal Transport

We introduce Hyper Input Convex Neural Networks (HyCNNs), a novel neural network architecture designed for learning convex functions. HyCNNs combine the...

HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents

Existing multimodal search agents process target entities sequentially, issuing one tool call per entity and accumulating redundant interaction rounds w...

HyperFitS -- Hypernetwork Fitting Spectra for metabolic quantification of ${}^1$H MR spectroscopic imaging

Purpose: Proton magnetic resonance spectroscopic imaging ($^1$H MRSI) enables the mapping of whole-brain metabolites concentrations in-vivo. However, a...

Hyperparameter Optimization

Systematic HPO - grid search, random search, Bayesian optimization with Optuna, Hyperband/ASHA pruning, and multi-objective optimization for production ML.

Hypothesis Testing over Observable Regimes in Singular Models

Hypothesis testing in singular statistical models is often regarded as inherently problematic due to non-identifiability and degeneracy of the Fisher in...

I know you are different! Towards Persona Driven Knowledge-infused Dialogue Assistant.

I know you are different! Towards Persona Driven Kno... - published at EACL 2026.

IaC for ML Teams

Why ML teams need Infrastructure as Code - reproducible environments, audit trails, cost control, and eliminating the manual infrastructure chaos that breaks ML at scale.

IaC Patterns for ML Platforms

Production IaC patterns for ML platform engineering - golden paths, blue-green infrastructure, self-destructing experiment environments, OPA policies, GPU quota management, and the internal developer platform model.

IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs

Key-Value (KV) cache plays a crucial role in accelerating inference in large language models (LLMs) by storing intermediate attention states and avoidin...

Ideas Have Genomes: Benchmarking Scientific Lineage Reasoning and Lineage-Grounded Idea Generation

Scientific ideas rarely start from a blank page. They inherit mechanisms, repair known limitations, and recombine pieces of earlier work, much like biol...

Idempotency and Retries

Making LLM-powered workflows robust with idempotency keys, smart retries, distributed deduplication, workflow state persistence, and failure-tolerant pipeline design for production AI systems.

Identifying Causal Effects Using a Single Proxy Variable

Unobserved confounding is a key challenge when estimating causal effects from a treatment on an outcome in scientific applications. In this work, we ass...

Image Generators are Generalist Vision Learners

Recent works show that image and video generators exhibit zero-shot visual understanding behaviors, in a way reminiscent of how LLMs develop emergent ca...

Image2Sim: Scaling Embodied Navigation via Generative Neural Simulator

Embodied navigation aims to build agents that interpret multimodal goals, reason in 3D space, and reach target destinations reliably in the real world....

Imagined Rollouts are Kinematic, Not Dynamic: A Diagnosis of Long-Horizon World-Model Failure

Long-horizon failure in world models is conventionally attributed to compounding error, a generic framing that does not distinguish what kind of error c...

Immutability Strategies - Tuples, Frozen Dataclasses, and Value Objects

Master Python's immutability toolkit at engineering depth - mutable vs immutable types, shallow vs deep immutability, namedtuple, frozen dataclasses, frozenset, MappingProxyType, and the replace/copy pattern for functional state updates. Covers DDD value objects and Redux-style state in Python.

ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models

Existing memory benchmarks for LLM agents evaluate explicit recall of facts, yet overlook implicit memory where experience becomes automated behavior wi...

Import Hooks and the Import System - Intercepting Module Loading

Master Python's import machinery - sys.meta_path finders, loaders, ModuleSpec, lazy imports, AST transformation on import, circular imports, and importlib.metadata for plugin discovery.

Improved Scaling Laws via Weak-to-Strong Generalization in Random Feature Ridge Regression

It is increasingly common in machine learning to use learned models to label data and then employ such data to train more capable models. The phenomenon...

Improving Semantic Proximity in Information Retrieval through Cross-Lingual Alignment

With the increasing accessibility and utilization of multilingual documents, Cross-Lingual Information Retrieval (CLIR) has emerged as an important rese...

In-Context Working Memory

Managing the context window as working memory: token budgeting, sliding windows, summarization, and the lost-in-the-middle problem.

In-Place Test-Time Training

The static ``train then deploy' paradigm fundamentally limits Large Language Models (LLMs) from dynamically adapting their weights in response to contin...

Industrial IoT and ML

Learn how to build IIoT data pipelines connecting industrial protocols (OPC-UA, MQTT, Modbus) to time-series databases, Kafka, and ML inference systems for manufacturing intelligence.

Inference Cost Optimization

Reduce LLM inference costs by 60–80% through quantization, intelligent batching, right-sizing, and autoscaling - turning an $80K/month bill into $20K.

Inference Cost Optimization

The economics of LLM inference serving - cost per million tokens, GPU utilization, continuous batching, speculative decoding, KV cache management, and building production systems under $1 per million tokens.

Inference Cost Optimization

Learn how to systematically reduce LLM inference costs using model selection, quantization, caching, request routing, prompt compression, and infrastructure strategies.

Inference Cost Optimization

Reducing ML serving costs at scale - quantization ROI, batching economics, instance right-sizing, caching strategies, and LLM cost-per-token analysis.

Inference Optimization for MoE Models

Production techniques for serving MoE models efficiently - expert caching, CPU offloading, vLLM support, tensor vs. expert parallelism, batch size sensitivity, and quantization strategies.

Inference Scaling

Horizontal and vertical scaling for ML inference - autoscaling policies, KEDA with custom GPU metrics, spot instances, global load balancing, and handling traffic spikes.

Inferential Mechanics Part 1: Causal Mechanistic Theories of Machine Learning in Chemical Biology with Implications

Machine learning techniques are now routinely encountered in research laboratories across the globe. Impressive progress has been made through ML and AI...

Infinite Worlds with Versatile Interactions

We present LingBot-World 2.0 (also known as LingBot-World-Infinity), an advanced iteration of LingBot-World featuring four distinct upgrades. (1) Our mo...

InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis

Large language models are emerging as scientific assistants, but evaluating their ability to reason from empirical data remains challenging. Benchmarks...

Influence Malleability in Linearized Attention: Dual Implications of Non-Convergent NTK Dynamics

Understanding the theoretical foundations of attention mechanisms remains challenging due to their complex, non-linear dynamics. This work reveals a fun...

Information Gain, Gini Impurity, and Entropy

A deep dive into how decision trees choose splits - Shannon entropy, information gain, Gini impurity, gain ratio, regression variance reduction, and the multi-valued feature bias every practitioner must understand.

Information Router for Mitigating Modality Dominance in Vision-Language Models

Vision Language models (VLMs) have demonstrated strong performance across a wide range of benchmarks, yet they often suffer from modality dominance, whe...

Information-geometric adaptive sampling for graph diffusion

Standard diffusion models for graph generation typically rely on uniform time-stepping, an approach that overlooks the non-homogeneous dynamics of distr...

Infrastructure as Code for ML

IaC for ML infrastructure - Terraform GPU clusters on AWS/GCP/Azure, Helm charts for model serving, Pulumi Python IaC, Ansible for GPU node setup, GitOps with ArgoCD, spot instance handling, and infrastructure cost optimization.

Infrastructure Monitoring for ML Systems

Monitoring the infrastructure layer of ML systems - CPU, GPU, memory, latency, the four monitoring layers, custom ML metrics with Prometheus, and building the observability foundation for model quality monitoring.

Inheritance - Single, Multiple, and Cooperative at Engineering Depth

Master Python inheritance at the engineering level - what inheritance actually does to namespaces, single and multiple inheritance, the MRO algorithm, cooperative super(), the fragile base class problem, isinstance/issubclass, and when inheritance is correct.

Initialisation Determines the Basin: Efficient Codebook Optimisation for Extreme LLM Quantization

Additive quantization enables extreme LLM compression with O(1) lookup-table dequantization, making it attractive for edge deployment. Yet at 2-bit prec...

Injecting Image Guidance into Text-Conditioned Diffusion Models at Inference

Text-to-image diffusion models like Stable Diffusion generate high-quality images from text, but lack a way to inject visual guidance (e.g. sketches, st...

InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models

Reducing the hardware footprint of large language models (LLMs) during decoding is critical for efficient long-sequence generation. A key bottleneck is...

Input Validation and Sanitization

Use Pydantic validators as security boundaries - prevent SQL injection, XSS, path traversal, SSRF, and file upload attacks through structural input validation in FastAPI.

InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation

Text and faces are among the most perceptually salient and practically important patterns in visual generation, yet they remain challenging for autoregr...

INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

Building world models with spatial consistency and real-time interactivity remains a fundamental challenge in computer vision. Current video generation...

Instruction Tuning

How instruction tuning transforms base LLMs into general-purpose assistants that can follow diverse instructions, reason step by step, and generalize to new tasks.

Instruction Tuning at Scale

How to instruction-tune open-source models at production scale - covering the FLAN insight, dataset construction principles, scaling laws for instruction data, multi-node training setup, and a complete pipeline for fine-tuning Llama 3 8B on a 2-node A100 cluster.

Instruction-Guided Poetry Generation in Arabic and Its Dialects

Poetry has long been a central art form for Arabic speakers, serving as a powerful medium of expression and cultural identity. While modern Arabic speak...

Instruction-Level Optimization

Master ILP, vectorized loads, loop unrolling, and instruction scheduling to extract maximum throughput from CUDA kernels - the techniques separating 31% from 78% peak utilization.

Instructor - Structured Outputs with Pydantic

A complete guide to Jason Liu's Instructor library - Pydantic-based structured extraction, automatic retry on validation failure, multi-provider support, streaming, and production extraction patterns.

InstructSAM: Segment Any Instance with Any Instructions

In this paper, we introduce InstructSAM, a unified and streamlined framework designed for multi-instance segmentation under arbitrary instructions. We f...

Integrated electro-optic attention nonlinearities for transformers

Transformers have emerged as the dominant neural-network architecture, achieving state-of-the-art performance in language processing and computer vision...

Intel Gaudi and Habana Labs

Intel Gaudi AI accelerator architecture - Tensor Processor Cores, built-in RoCE scale-out networking, SynapseAI SDK, and price-performance positioning against NVIDIA H100 for LLM training.

Intellectual Property and AI

Patent analysis, prior art search, trademark similarity detection, and the ML systems that support patent prosecution, portfolio management, and IP litigation.

IntentGrasp: A Comprehensive Benchmark for Intent Understanding

Accurately understanding the intent behind speech, conversation, and writing is crucial to the development of helpful Large Language Model (LLM) assista...

IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation

Robot imitation data are often multimodal: similar visual-language observations may be followed by different action chunks because human demonstrators a...

InteractWeb-Bench: Can Multimodal Agent Escape Blind Execution in Interactive Website Generation?

With the advancement of multimodal large language models (MLLMs) and coding agents, the website development has shifted from manual programming to agent...

Interleaving Experiments

Use interleaving to compare ranking models with 10-25x better sensitivity than A/B tests - the technique behind fast iteration at search and recommendation companies.

InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search

Existing benchmarks for multimodal agentic search evaluate multimodal search and visual browsing, but visual evidence is either confined to the input or...

Interpretability vs Explainability - Clearing Up the Confusion

The difference between understanding how a model works (interpretability) and explaining a specific prediction (explainability) - and why that distinction shapes regulation, trust, and system design.

InTriage: Intelligent Telephone Triage in Pre-Hospital Emergency Care.

InTriage: Intelligent Telephone Triage in Pre-Hospit... - published at EMNLP 2025.

Introspective Diffusion Language Models

Diffusion language models promise parallel generation, yet still lag behind autoregressive (AR) models in quality. We stem this gap to a failure of intr...

Invariance-Based Dynamic Regret Minimization

We consider stochastic non-stationary linear bandits where the linear parameter connecting contexts to the reward changes over time. Existing algorithms...

Inventory Optimization

Newsvendor problem, safety stock optimization, reorder point prediction, multi-echelon inventory, and ML-driven policies that balance stockouts against carrying costs at retail scale.

Inverse Contextual Bandits without Rewards: Learning from a Non-Stationary Learner via Suffix Imitation

We study the Inverse Contextual Bandit (ICB) problem, in which a learner seeks to optimize a policy while an observer, who cannot access the learner's r...

Inversion-Free Natural Gradient Descent on Riemannian Manifolds

The natural gradient method is widely used in statistical optimization, but its standard formulation assumes a Euclidean parameter space. This paper pro...

IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models.

IrokoBench: A New Benchmark for African Languages in... - published at NAACL 2025.

Is More Data Worth the Cost? Dataset Scaling Laws in a Tiny Attention-Only Decoder

Training Transformer language models is expensive, as performance typically improves with increasing dataset size and computational budget. Although sca...

Is One Layer Enough? Training A Single Transformer Layer Can Match Full-Parameter RL Training

Reinforcement learning (RL) has become a central component of post-training large language models (LLMs), yet little is understood about how RL adaptati...

Is Position Bias in Dense Retrievers Built In-or Learned from Data?

Dense retrievers exhibit positional bias, favoring documents whose query-relevant information appears near the beginning and degrading retrieval perform...

Iterative Identification Closure: Amplifying Causal Identifiability in Linear SEMs

The Half-Trek Criterion (HTC) is the primary graphical tool for determining generic identifiability of causal effect coefficients in linear structural e...

iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning

While visually grounded Chain-of-Thought (CoT) has emerged as a promising paradigm to enhance fine-grained perception in multimodal large language model...

Jailbreaks and Adversarial Prompts

How safety training gets bypassed - jailbreak taxonomy, GCG attacks, many-shot jailbreaking, prompt injection, defenses, and why the arms race is hard to win.

Jailbreaks and Bypasses

Taxonomy of jailbreak techniques, why they work, evaluation frameworks, and layered defense strategies for production LLM systems.

JD Oxygen AI Item Center (Oxygen AIIC) V1: An Industrial-Scale LLM/VLM-Centric Solution for Item Understanding, Management, and Applications

JD.com, one of the world's largest e-commerce platforms, serves over 700 million active users and millions of merchants, with a catalog of tens of billi...

Jet-Long: Efficient Long-Context Extension with Dynamic Bifocal RoPE

Modern LLMs are increasingly deployed in long-context applications such as retrieval-augmented generation, repository-level coding, and agentic workflow...

JIT Compilation and numba

Just-in-time compilation principles from first principles, numba's LLVM backend and type inference system, GPU kernels with numba CUDA, and when JIT compilation delivers real performance gains.

JLT: Clean-Latent Prediction in Latent Diffusion Transformers

Flow matching with clean-data prediction has shown that regressing the clean point can exploit low-dimensional structure more effectively than predictin...

Joint Training of Multi-Token Prediction in Reinforcement Learning via Optimal Coefficient Calibration

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as the standard paradigm for improving reasoning capability of large language models,...

Joint-Centric Dual Contrastive Alignment with Structure-Preserving and Information-Balanced Regularization

We propose HILBERT (HIerarchical Long-sequence Balanced Embedding with Reciprocal contrastive Training), a cross-attentive multimodal framework for lear...

JSON Mode and Tool/Function Schemas

A complete guide to native JSON mode, OpenAI Structured Outputs, tool calling for structured data, Anthropic tool use, parallel tool calls, and schema design best practices.

JSON Serialization - Production-Grade Encoding and Decoding

Master JSON serialization in Python at engineering depth - custom encoders, datetime/Decimal/UUID handling, orjson and msgspec for high-throughput APIs, NDJSON streaming, content negotiation, and why float precision silently destroys financial data.

JumpLoRA: Sparse Adapters for Continual Learning in Large Language Models

Adapter-based methods have become a cost-effective approach to continual learning (CL) for Large Language Models (LLMs), by sequentially learning a low-...

JWT Authentication

Master stateless JWT authentication - token structure, signing algorithms, refresh token rotation, common pitfalls, and building production-grade FastAPI JWT middleware.

K-Means Clustering

Master K-means clustering - Lloyd's algorithm convergence proof, K-means++ initialization with D² weighting, silhouette analysis, elbow method, Mini-batch K-means for large datasets, and customer segmentation pipelines.

Kafka for ML Systems

Using Apache Kafka as the backbone of production ML systems - schema registry, CDC, exactly-once semantics, and dead letter queues.

Kafka Streams vs Apache Flink - The ML Pipeline Decision Guide

A comprehensive comparison of Kafka Streams, Faust, and Apache Flink for building real-time ML feature pipelines, with a production decision framework and working code examples.

Kernel Bypass and DPDK

Kernel bypass networking for ML clusters - DPDK architecture, RDMA and InfiniBand for GPU-to-GPU communication, NCCL's bypass path, io_uring, eBPF, and when these techniques matter for AllReduce latency.

Kernel Fusion Strategies

How kernel fusion eliminates HBM round-trips between chained GPU operations, how torch.compile and TorchInductor identify fusible patterns, and how to write manual fused kernels with Triton for maximum throughput.

Kernel Integrated $R^2$: A Measure of Dependence

We introduce kernel integrated $R^2$, a new measure of statistical dependence that combines the local normalization principle of the recently introduced...

KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

LLM-based Triton kernel generation has attracted significant interest, yet a fundamental empirical question remains unanswered: where does this capabili...

Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning

Recent advances in large language models (LLMs) have increasingly relied on reinforcement learning (RL) to improve their reasoning capabilities. Three a...

KeyFrame-Compass: Towards Comprehensive Evaluation of Keyframe-Conditioned Video Generation

Video generation increasingly relies on keyframe-based workflows, where creators specify a sequence of reference images to guide generation. Although re...

KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning

Robotic systems that interact with the physical world must reason about kinematic and dynamic constraints imposed by their own embodiment, their environ...

KLIP: localized distribution shift detection via KL-divergence with diffusion priors in Inverse Problems

Diffusion models have shown promising performance as data-driven priors for computational imaging, as well as some capacity to detect out-of-distributio...

Know Before Fix: QA-Driven Repository Knowledge Acquisition for Software Issue Resolution

LLM-based coding agents have significantly advanced automated software issue resolution, yet they remain highly prone to factual errors caused by insuff...

KnowAct-GUIClaw: Know Deeply, Act Perfectly, Personal GUI Assistant with Self-Evolving Memory and Skill

OpenClaw has emerged as a leading agent framework for complex task automation, yet it faces insufficient cross-platform GUI interaction support and a we...

Knowledge Distillation for LLMs

Training smaller student models to match larger teacher models - soft labels, temperature scaling, intermediate representation matching, API-based distillation, and a complete production pipeline for task-specific deployment.

Knowledge Graph Embeddings

TransE, RotatE, CompGCN - embedding entities and relations in vector spaces to predict missing facts in knowledge graphs, enabling AI systems to reason about structured world knowledge.

Knowledge Tracing Models

Learn Bayesian Knowledge Tracing (BKT), Deep Knowledge Tracing (DKT), SAKT, and AKT - models that estimate student knowledge state over time from interaction sequences.

KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance

RLVR improves reasoning in large language models, but its effectiveness is often limited by severe reward sparsity on hard problems. Recent hint-based R...

KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation

Personalized mobile agents that infer user preferences and calibrate proactive assistance hold great promise as everyday digital assistants, yet existin...

Kolmogorov-Arnold causal generative models

Causal generative models provide a principled framework for answering observational, interventional, and counterfactual queries from observational data....

KronQ: LLM Quantization via Kronecker-Factored Hessian

Post-training quantization (PTQ) is a widely adopted technique for compressing large language models (LLMs) without retraining. Existing second-order PT...

KServe and Kubernetes ML Operators

Custom Kubernetes operators for ML workflows - what operators enable, KServe for standardized model serving, Seldon Core, the Kubeflow Training Operator, Argo Workflows, and when to build vs. use existing operators.

Kubeflow Pipelines

Building, compiling, and running production ML pipelines on Kubernetes using Kubeflow Pipelines v2 with MLMD metadata tracking and automatic retraining triggers.

Kubernetes and Auto-Scaling for LLMs

Deploy LLMs on Kubernetes with GPU scheduling, HPA and KEDA for autoscaling, MIG partitioning on A100/H100, and Karpenter for on-demand GPU node provisioning.

Kubernetes for ML

Use Kubernetes as ML infrastructure - from GPU scheduling and device plugins to Kubeflow Pipelines and autoscaling - migrating ML workloads from VMs to K8s without disruption.

Kubernetes Fundamentals for ML Engineers

The minimum Kubernetes knowledge every ML engineer needs to be productive - pods, deployments, services, resource requests, GPU allocation, probes, and persistent volumes.

KV Cache

Learn how the key-value cache eliminates redundant attention computation in LLM inference, and how PagedAttention solves the memory fragmentation problem.

KV Cache Management and PagedAttention

How the KV cache works in transformer inference, why naive memory allocation wastes 60-70% of GPU memory, and how PagedAttention from vLLM solved fragmentation using virtual memory techniques from operating systems.

KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs

Large Language Models (LLMs) rely heavily on Key-Value (KV) caching to minimize inference latency. However, standard KV caches are context-dependent: re...

KWBench: Measuring Unprompted Problem Recognition in Knowledge Work

We introduce the first version of KWBench (Knowledge Work Bench), a benchmark for unprompted problem recognition in large language models: can an LLM id...

L2GTX: From Local to Global Time Series Explanations

Deep learning models achieve high accuracy in time series classification, yet understanding their class-level decision behaviour remains challenging. Ex...

LACUNA: Safe Agents as Recursive Program Holes

LLM agents increasingly act by writing code, yet a split persists between the runtime that drives the agent and the code the model writes. The runtime o...

Lakehouse Architecture for ML

Lakehouse architecture for ML systems - Delta Lake, Apache Iceberg, Apache Hudi, medallion architecture, query engines, and ML pipelines on the lakehouse.

Lakehouse for ML Workflows

Storing training datasets, experiment artifacts, and model outputs in a lakehouse.

Lakehouse Query Engines

Trino, DuckDB, Spark SQL - querying open table formats at scale.

Lambda and Kappa Architecture for ML Systems

Master Lambda and Kappa architecture - the two dominant patterns for building ML systems that handle both historical and real-time data at scale.

Lambda Expressions - Anonymous Functions at Engineering Depth

Understand Python lambda expressions at engineering depth - anonymous function objects, compile-time vs call-time evaluation, the loop-closure trap, late binding, the default-argument fix, and when lambda is and is not appropriate.

LangChain Architecture - REPLACED

replaced

LangChain Deep Dive

A thorough guide to LangChain's core abstractions, LCEL composable pipelines, LangGraph stateful workflows, LangSmith observability, and when to use LangChain vs direct API calls.

LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling

Continuous diffusion has been the foundation of high-fidelity, controllable, and few-step generation of many data modalities such as images. However, in...

Langfuse - Open-Source LLM Observability

Master Langfuse for production LLM observability - self-hosted tracing, evaluation datasets, prompt management, cost attribution by feature, and full data sovereignty for regulated industries.

LangGraph

LangGraph: stateful graph-based multi-agent systems with checkpointing, human-in-the-loop, streaming, and the supervisor pattern - the most powerful and flexible agent framework.

LangGraph for Stateful Agents

Graph-based stateful agent orchestration with LangGraph - StateGraph, typed state, nodes, conditional edges, checkpointing, and human-in-the-loop.

LangSmith Deep Dive

Master LangSmith for LLM observability - production tracing, dataset curation, evaluation pipelines, prompt versioning, annotation queues, and deployment gating for AI systems.

Language Modeling Objectives

Learn the training objectives that teach LLMs to understand language - causal language modeling, masked language modeling, cross-entropy loss, and perplexity.

Language Models Need Sleep

Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context leng...

LaRA: Layer-wise Representation Analysis for Detecting Data Contamination in RL Post-Training

Reinforcement learning (RL) post-training has shown to improve reasoning in large language models (LLMs). However, there has been little exploration on...

Large deviation principles for convolutional Bayesian neural networks

While suitably scaled CNNs with Gaussian initialization are known to converge to Gaussian processes as the number of channels diverges, little is known...

Large Language Model Systems

Deploying Llama-3-70B for a 100K DAU application - vLLM serving, tensor parallelism, KV cache management, speculative decoding, LoRA serving, cost management, and RAG integration.

Large Language Models Align with the Human Brain during Creative Thinking

Creative thinking is a fundamental aspect of human cognition, and divergent thinking-the capacity to generate novel and varied ideas-is widely regarded...

Large Language Models Explore by Latent Distilling

Generating diverse responses is crucial for test-time scaling of large language models (LLMs), yet standard stochastic sampling mostly yields surface-le...

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

Large language models (LLMs) undergo alignment training to avoid harmful behaviors, yet the resulting safeguards remain brittle: jailbreaks routinely by...

Large-Scale Memory Optimization

Master the memory math behind training and serving large language models - from mixed precision and gradient checkpointing to ZeRO optimizer stages, KV cache management, and PagedAttention.

LARY: A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment

While the shortage of explicit action data limits Vision-Language-Action (VLA) models, human action videos offer a scalable yet unlabeled data source. A...

LASA: Language-Agnostic Semantic Alignment at the Semantic Bottleneck for LLM Safety

Large language models (LLMs) often demonstrate strong safety performance in high-resource languages, yet exhibit severe vulnerabilities when queried in...

Latency and Cost Tradeoffs

How to decompose LLM latency and cost, choose the right optimization strategies, and define SLOs that balance quality, speed, and budget.

Latency vs Throughput Trade-offs in ML Systems

Understanding the fundamental tension between latency and throughput in ML serving - Little's Law, tail latency, batching strategies, and caching for production ML systems.

Latent Diffusion Models - The Architecture Behind Stable Diffusion

How Rombach et al. moved diffusion from pixel space to a compressed latent space via KL-VAE with perceptual and adversarial losses, cross-attention conditioning, and the complete Stable Diffusion pipeline - enabling high-resolution generation on consumer GPUs.

Latent Preference Modeling for Cross-Session Personalized Tool Calling

Users often omit essential details in their requests to LLM-based agents, resulting in under-specified inputs for tool use. This poses a fundamental cha...

Latent-GRPO: Group Relative Policy Optimization for Latent Reasoning

Latent reasoning offers a more efficient alternative to explicit reasoning by compressing intermediate reasoning into continuous representations and sub...

Latent-Identity Tuning in Text-to-Image Personalization Models

Generating and editing a person's face demands high precision, as even minor modifications can significantly alter a subject's perceived identity. Curre...

LATO.2: Factorized 3D Mesh Generation with Vertex and Topology Flow

Flow matching over carefully designed latent representations has recently emerged as a powerful paradigm for topology-aware mesh generation. Existing ap...

Layer Normalization and Residual Connections

How layer normalization and residual connections solve gradient flow in deep transformers and enable training of 100+ layer networks.

Layer-wise Cross-Lingual Depression Detection from Speech: Analysis with Contrastive Alignment

Significant disparities exist in the diagnosis and clinical presentation of depression across different linguistic populations. Speech-based depression...

LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories

This paper focuses on the alignment of flow matching models with human preferences. A promising way is fine-tuning by directly backpropagating reward gr...

Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents

Computer-use agents (CUAs) have recently made substantial progress, but deploying a separate large expert for each software domain remains expensive. Sm...

Learnability and Privacy Vulnerability are Entangled in a Few Critical Weights

Prior approaches for membership privacy preservation usually update or retrain all weights in neural networks, which is costly and can lead to unnecessa...

Learning A Unified Risk Map for Autonomous Driving in Partially Observable Environments

Occlusion-aware prediction remains a critical challenge in autonomous driving due to the inherent uncertainty of unobserved regions. Existing approaches...

Learning Adaptive Reasoning Paths for Efficient Visual Reasoning

Visual reasoning models (VRMs) have recently shown strong cross-modal reasoning capabilities by integrating visual perception with language reasoning. H...

Learning Evidence Highlighting for Frozen LLMs

Large Language Models (LLMs) can reason well, yet often miss decisive evidence when it is buried in long, noisy contexts. We introduce HiLight, an Evide...

Learning High-Frequency Continuous Action Chunks in Latent Space

Modern robotic policies increasingly rely on action chunking to execute complex tasks in the physical world. While action chunking improves temporal con...

Learning interacting particle systems from unlabeled data

Learning the potentials of interacting particle systems is a fundamental task across various scientific disciplines. A major challenge is that unlabeled...

Learning Long-term Motion Embeddings for Efficient Kinematics Generation

Understanding and predicting motion is a fundamental component of visual intelligence. Although modern video models exhibit strong comprehension of scen...

Learning Rate Scheduling

Every major learning rate schedule - step decay, cosine annealing, SGDR warm restarts, linear warmup, 1cycle policy, LR finder - with full PyTorch implementations, the warmup mechanics for Adam, polynomial decay, and a complete selection guide.

Learning Rate Transfer in Normalized Transformers

The Normalized Transformer, or nGPT (arXiv:2410.01131) achieves impressive training speedups and does not require weight decay or learning rate warmup....

Learning the Helmholtz equation operator with DeepONet for non-parametric 2D geometries

This paper deals with solving the 2D Helmholtz equation on non-parametric domains, leveraging a physics-informed neural operator network based on the De...

Learning the Signature of Memorization in Autoregressive Language Models

All prior membership inference attacks for fine-tuned language models use hand-crafted heuristics (e.g., loss thresholding, Min-K\%, reference calibrati...

Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments

Recent advances in large language models (LLMs) have facilitated the widespread deployment of LLMs as interactive agents capable of reasoning, planning,...

Learning to Hint for Reinforcement Learning

Group Relative Policy Optimization (GRPO) is widely used for reinforcement learning with verifiable rewards, but it often suffers from advantage collaps...

Learning to Learn-at-Test-Time: Language Agents with Learnable Adaptation Policies

Test-Time Learning (TTL) enables language agents to iteratively refine their performance through repeated interactions with the environment at inference...

Learning to Rank - Teaching Models to Sort, Not Just Score

How pointwise, pairwise, and listwise ranking approaches train models to produce the optimal ordering of items for search and recommendation.

Learning to Reason with Insight for Informal Theorem Proving

Although most of the automated theorem-proving approaches depend on formal proof systems, informal theorem proving can align better with large language...

Learning to Retrieve from Agent Trajectories

Information retrieval (IR) systems have traditionally been designed and trained for human users, with learning-to-rank methods relying heavily on large-...

Learning Versatile Humanoid Manipulation with Touch Dreaming

Humanoid robots promise general-purpose assistance, yet real-world humanoid loco-manipulation remains challenging because it requires whole-body stabili...

Legal LLM Fine-Tuning

Domain adaptation of LLMs for legal tasks - LegalBench evaluation, instruction tuning on legal data, and building legal AI models that outperform general-purpose LLMs on specific tasks.

Legal Research Automation

Dense retrieval over case law, citation graph analysis, precedent finding, and building legal research AI that surfaces relevant authorities without hallucinating fake cases.

LEMUR: Robust Fine-Tuning for Multilingual Embedding Models for Retrieval.

LEMUR: Robust Fine-Tuning for Multilingual Embedding... - published at EACL 2026.

Length Penalties Make Chain-of-Thought Less Monitorable

Length-penalized reinforcement learning can shorten chain-of-thought reasoning while hiding an influence that drives the model's answer. In our experime...

Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling

Token serves as the fundamental unit of computation in modern autoregressive models, and generation length directly influences both inference cost and r...

Less Detail, Better Answers: Degradation-Driven Prompting for VQA

Recent advancements in Vision-Language Models (VLMs) have significantly pushed the boundaries of Visual Question Answering (VQA).However,high-resolution...

Less is More: Early Stopping Rollout for On-Policy Distillation

On-policy distillation has recently emerged as a promising alternative to standard sequence-level imitation, training a student by scoring its own rollo...

Let RGB Be the Language of Vision

This work introduces a unified formulation for vision models, where diverse forms of visual information beyond natural images, such as masks, depth maps...

Leveraging Language-based Representations for Better Solving Symbol-related Problems with Large Language Models.

Leveraging Language-based Representations for Better... - published at COLING 2025.

Leveraging LLM-GNN Integration for Open-World Question Answering over Knowledge Graphs.

Leveraging LLM-GNN Integration for Open-World Questi... - published at EACL 2026.

LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models

Vision-Language-Action (VLA) models achieve strong performance in robotic manipulation by leveraging pre-trained vision-language backbones. However, in...

Light Interaction: Training-Free Inference Acceleration for Interactive Video World Models

Interactive video world models generate video chunk by chunk in response to user-controlled camera movements, enabling applications such as real-time ga...

Light-Omni: Reflex over Reasoning in Agentic Video Understanding with Long-Term Memory

Agentic video understanding equips models with long-term memory to autonomously process and respond to continuous, long-horizon multimodal streams. Howe...

Lighting-grounded Video Generation with Renderer-based Agent Reasoning

Diffusion models have achieved remarkable progress in video generation, but their controllability remains a major limitation. Key scene factors such as...

LightMem-Ego: Your AI Memory for Everyday Life

Personal AI assistants on mobile and wearable devices continuously perceive users' daily lives through visual and audio streams. However, answering quer...

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, standard OPD requires a live teacher...

Lightning Unified Video Editing via In-Context Sparse Attention

Video editing has evolved toward In-Context Learning (ICL) paradigms, yet the resulting quadratic attention costs create a critical computational bottle...

LightThinker++: From Reasoning Compression to Memory Management

Large language models (LLMs) excel at complex reasoning, yet their efficiency is limited by the surging cognitive overhead of long thought traces. In th...

Like a Therapist, But Not: Reddit Narratives of AI in Mental Health Contexts.

Like a Therapist, But Not: Reddit Narratives of AI i... — published at ACL 2026.

LIME - Local Interpretable Model-Agnostic Explanations

LIME explains any black-box classifier by fitting a local linear approximation around a specific prediction - the algorithm, variants, limitations, and when to use it vs SHAP.

Limitations of Attention at Scale

Why the quadratic complexity of self-attention creates real production bottlenecks - memory, latency, and cost - and why sparse attention approximations only partially solve the problem.

line_profiler and memory_profiler - Line-Level Analysis

Line-by-line time and memory profiling with line_profiler, memory_profiler, tracemalloc, and pympler - finding the exact lines that are slow or leak memory.

Linear Attention Architectures: Mechanisms, Trade-offs, and Cross-Layer Routing

Self-attention lets each token retrieve information from the full context, but its quadratic cost in sequence length limits training and inference at lo...

Linear Interpolation and Model Soup

How weight averaging of fine-tuned models produces better, more robust models than any individual fine-tune - and the task arithmetic framework for composing capabilities.

Linear Models, Variable Selection, Artificial Intelligence

Variable selection in linear regression models has been a problem since hypothesis testing began. Which variables to include or exclude from a model is...

Linear Regression Internals

Deep dive into linear regression - OLS derivation, normal equations, geometric interpretation as projection, Gauss-Markov theorem, residual diagnostics, Cook's distance, VIF, multicollinearity, and full NumPy implementation.

Linear-Core Surrogates: Smooth Loss Functions with Linear Rates for Classification and Structured Prediction

The choice of loss function in classification involves a fundamental trade-off: smooth losses (like Cross-Entropy) enable fast optimization rates but yi...

Linear-Time Global Visual Modeling without Explicit Attention

Existing research largely attributes the global sequence modeling capability of Transformers to the explicit computation of attention weights, a process...

Linking spatial biology and clinical histology via Haiku

Integrating molecular, morphological, and clinical data is essential for basic and translational biomedical research, yet systematic frameworks for join...

Linting and Formatting - Ruff, Black, isort, and mypy

Master Python code quality tooling at engineering depth - Ruff's rule categories, Black's opinionated formatting, isort profiles, mypy static type checking, pyproject.toml configuration, and how to wire all tools into a coherent developer workflow.

Linux Performance Tuning

Systematic Linux performance tuning for ML workloads - sysctl parameters, CPU governors, NUMA balancing, transparent huge pages, IRQ affinity, NIC tuning, and grub options that matter for training throughput and inference latency.

Linux Process Scheduling

Understand Linux CFS scheduler, nice values, CPU affinity, real-time scheduling, cgroups, NUMA, and how Kubernetes CPU throttling destroys ML training throughput - with concrete fixes.

Lipschitz bounds for integral kernels

Feature maps associated with positive definite kernels play a central role in kernel methods and learning theory, where regularity properties such as Li...

Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for large language models (LLMs) post-training to incentivize reaso...

LiteCoder-Terminal: Scaling Long-Horizon Terminal Environments for Learning Language Agents

Mastering terminal environments requires language agents capable of multi-step planning, feedback-grounded execution, and dynamic state adaptation. Howe...

LiteLLM

Deploy LiteLLM as a universal LLM proxy supporting 100+ providers. Configure routing, load balancing, fallbacks, semantic caching, and cost tracking through a single OpenAI-compatible endpoint.

LiVeAction: a Lightweight, Versatile, and Asymmetric Neural Codec Design for Real-time Operation

Modern sensors generate rich, high-fidelity data, yet applications operating on wearable or remote sensing devices remain constrained by bandwidth and p...

LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

Are LLM-based search agents genuinely searching, or using the web to verify what they already know? We study this question on BrowseComp with three diag...

LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal understanding and generation within a nativel...

LLaMA Family Architecture

A deep dive into Meta's LLaMA model family - from LLaMA 1 through LLaMA 3.3 - covering RoPE embeddings, SwiGLU activation, RMSNorm, grouped query attention, and when to choose each variant.

llama.cpp and GGUF Format

llama.cpp - Georgi Gerganov's C++ inference engine that runs quantized LLMs on CPUs and consumer GPUs. GGUF binary format, quantization types, performance tuning, and practical local inference.

LlamaIndex Architecture

LlamaIndex's document-centric agent framework - VectorStoreIndex, QueryEngine, FunctionCallingAgent, and the Workflow event-driven orchestration model.

LlamaIndex Deep Dive

A comprehensive guide to LlamaIndex's data-centric architecture - indices, query engines, workflows, multi-document agents, and how it compares to LangChain for RAG applications.

LLaTiSA: Towards Difficulty-Stratified Time Series Reasoning from Visual Perception to Semantics

Comprehensive understanding of time series remains a significant challenge for Large Language Models (LLMs). Current research is hindered by fragmented...

LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence

We introduce LLaVA-OneVision-2 (LLaVA-OV-2), the most capable vision-language model in the LLaVA-OneVision series to date, achieving superior performanc...

LLM as Agent Judge

Using LLMs to evaluate other agents' trajectories and outputs at scale - rubric design, pairwise comparison, bias mitigation, calibration, and escalation logic.

LLM as Data Generator

Use frontier LLMs to generate high-quality instruction-following, reasoning, and preference datasets - sampling strategies, diversity maximization, and quality vs. quantity tradeoffs.

LLM CI/CD

CI/CD pipelines for LLM applications - handling non-deterministic outputs with LLM-judge gates, canary deployments with quality monitoring, automated rollback triggers, and full GitHub Actions implementation.

LLM Evaluation Pipelines

Build automated evaluation pipelines for LLM systems - LLM-as-judge, RAGAS for RAG systems, trajectory evaluation for agents, regression testing, and eval dataset curation.

LLM Gateway and Routing

Design and operate an LLM gateway - unified API, model routing, circuit breakers, budget enforcement, and fallback chains - using LiteLLM and custom routing logic.

LLM Product Architecture

The three fundamental LLM product patterns - chat, workflow automation, and autonomous agents - and how to design the production service graph for each.

LLM Safety From Within: Detecting Harmful Content with Internal Representations

Guard models are widely used to detect harmful content in user prompts and LLM responses. However, state-of-the-art guard models rely solely on terminal...

LLM-as-a-Tutor: Policy-Aware Prompt Adaptation for Non-Verifiable RL

Reinforcement learning (RL) for non-verifiable instruction following increasingly relies on LLM judges with prompt-specific rubrics as reward signals. W...

LLM-as-Judge

Build calibrated, bias-corrected LLM judges that approximate human judgment at scale - pointwise scoring, pairwise comparison, bias mitigation, and ensemble techniques.

LLM-as-Judge

Use powerful LLMs to automatically evaluate other models - with position bias mitigation, CoT judging, and cost analysis.

LLM-Coordination: Evaluating and Analyzing Multi-agent Coordination Abilities in Large Language Models.

LLM-Coordination: Evaluating and Analyzing Multi-age... - published at NAACL 2025.

LLM-Powered Product Architecture

End-to-end design of a production LLM-powered product - covering the serving stack, prompt management, RAG architecture, multi-LLM routing, streaming, cost management, and observability.

LLMInit: A Free Lunch from Large Language Models for Selective Initialization of Recommendation.

LLMInit: A Free Lunch from Large Language Models for... - published at EMNLP 2025.

LLMOps Platforms

Comprehensive guide to LLMOps platforms - LangSmith, Langfuse, W&B Weave, Arize Phoenix, Helicone, and PromptLayer. When to build vs buy, integration patterns, abstraction layers, and production-grade Python examples using the Anthropic SDK.

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

Test-time scaling (TTS) has become an effective approach for improving large language model performance by allocating additional computation during infe...

LLVM and MLIR

LLVM compiler infrastructure and MLIR multi-level IR for ML - how they power PyTorch, JAX, TensorFlow, Triton, and IREE, with SSA form, optimization passes, dialect design, and practical code generation for ML workloads.

LM Studio and GUI Tools

LM Studio, Jan.ai, GPT4All, and Open WebUI for running LLMs locally - model discovery, hardware acceleration, local server mode, OpenAI-compatible APIs, and building a complete local AI development workspace.

LMQL and Guidance - Programmatic LLM Control

How Microsoft Guidance and LMQL extend structured generation to full programmatic control - interleaving generation with code, SQL-like constraints, token healing, and when each tool wins over Outlines and Instructor.

Load Balancing Across Providers

Distribute LLM traffic across multiple API keys and providers using round-robin, weighted, least-connections, and latency-based routing to scale throughput beyond single-key limits.

Load Balancing and Request Routing

Load balancing strategies for LLM serving - prefix-aware routing for KV cache reuse, least-connections for variable-cost requests, model routing, circuit breakers, and building a production gateway.

LoBoost: Fast Model-Native Local Conformal Prediction for Gradient-Boosted Trees

Gradient-boosted decision trees are among the strongest off-the-shelf predictors for tabular regression, but point predictions alone do not quantify unc...

Locally Confident, Globally Stuck: The Quality-Exploration Dilemma in Diffusion Language Models

Diffusion large language models (dLLMs) theoretically permit token decoding in arbitrary order, a flexibility that could enable richer exploration of re...

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

Vision-language models (VLMs) commonly formulate visual grounding and detection as a coordinate-token generation problem, serializing each 2D box into m...

Log-Ratio Propagation on the Simplex: A Theory of Cellwise Contamination for Compositional Data

Compositional data must be analysed through log-ratios: scale invariance, the defining axiom of the field, leaves no alternative. The centred log-ratio...

Logging for ML Systems

Structured logging for ML systems - prediction logging for delayed evaluation, structured JSON logs, audit logs for regulated models, log aggregation with Loki and Elasticsearch, and tracing individual prediction failures.

Logistic Regression Deep Dive

Master logistic regression from first principles - sigmoid derivation, log-likelihood to cross-entropy, decision boundary geometry, softmax multiclass, probability calibration with ECE, class imbalance handling, and full NumPy implementation.

Loki: An Open-Source Tool for Fact Verification.

Loki: An Open-Source Tool for Fact Verification. - published at COLING 2025.

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

Vision-Language Models (VLMs) have achieved substantial progress across a wide range of understanding and reasoning tasks, driven by large-scale image-t...

Long Context Pre-Training with Lighthouse Attention

Training causal transformers at extreme sequence lengths is bottlenecked by the quadratic time and memory of scaled dot-product attention (SDPA). In thi...

Long Live The Balance: Information Bottleneck Driven Tree-based Policy Optimization

Recent advances in online reinforcement learning (RL) for large language models (LLMs) have demonstrated promising performance in complex reasoning task...

Long-Context Evaluation

Evaluating LLM long-context capability - the Needle in a Haystack test, RULER benchmark, lost-in-the-middle phenomenon, and measuring effective context utilization vs claimed context window size.

Long-Horizon-Terminal-Bench: Testing the Limits of Agents on Long-Horizon Terminal Tasks with Dense Reward-Based Grading

AI agents have become capable of autonomously completing short, well-specified tasks. However, existing terminal benchmarks largely focus on simple prob...

LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning

Reinforcement Learning (RL) has emerged as a critical driver for enhancing the reasoning capabilities of Large Language Models (LLMs). While recent adva...

LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV

Audio-visual generation is rapidly advancing from short clips to minute-long content, while existing evaluation protocols remain largely confined to sho...

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

Real-world data analysis is inherently iterative, yet existing benchmarks mostly evaluate isolated or short interactive tasks, leaving agents' ability t...

LongE2V: Long-Horizon Event-based Video Reconstruction, Prediction, and Frame Interpolation with Video Diffusion Models

Recovering high-quality video from sparse event streams is a challenging task. Regression methods often blur textures, while existing generative models...

LongStraw: Long-Context RL Beyond 2M Tokens under a Fixed GPU Budget

A growing gap separates inference context lengths from RL post-training: inference systems are approaching million-token contexts, while post-training w...

LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

Long-context reasoning remains a central challenge for large language models, which often fail to locate and integrate key information in extensive dist...

Loop the Loopies!

We present Loopie, the most powerful looped Transformer to date. The Loopie series consists of two Mixture-of-Experts (MoE) models: a 20B-parameter mode...

LoopCTR: Unlocking the Loop Scaling Power for Click-Through Rate Prediction

Scaling Transformer-based click-through rate (CTR) models by stacking more parameters brings growing computational and storage overhead, creating a wide...

LoRA for Efficient Fine-Tuning

LoRA and QLoRA: fine-tune 70B models on a single GPU by freezing the base model and training only small low-rank adapter matrices - the technique that democratized LLM customization.

LoRA Mathematics and Implementation

Learn how LoRA (Low-Rank Adaptation) decomposes weight updates into low-rank matrices, why this works mathematically, and how to implement it from scratch in PyTorch and with HuggingFace PEFT.

LoRA: Low-Rank Adaptation

Master LoRA - the parameter-efficient fine-tuning method that adds only 0.3% of parameters to GPT-3 while matching full fine-tuning quality, making LLM fine-tuning feasible on a single GPU.

Lost in Sampling: Assessing Lexical Reachability in LLMs via the Word Coverage Score (WCS)

Modern Large Language Models (LLMs) are often criticized for producing repetitive and homogeneous text, despite possessing vast latent vocabularies. Whi...

Lost in the Middle - How LLMs Use Long Contexts

The empirical finding that LLMs reliably recall information at the beginning and end of long contexts but miss information in the middle, and strategies to mitigate this U-shaped performance degradation.

Low-degree Lower bounds for clustering in moderate dimension

We study the fundamental problem of clustering $n$ points into $K$ groups drawn from a mixture of isotropic Gaussians in $\mathbb{R}^d$. Specifically, w...

Low-Latency Feature Serving

Redis, Cassandra, and in-memory stores for sub-millisecond feature retrieval.

Low-Latency Inference Patterns

Engineering ML predictions under 10ms p99 - hardware choices, model optimization, batching strategies, pre-computation, memory layout, and real production targets.

Low-Latency Optimization

Engineering for ultra-low latency inference - NUMA awareness, CPU affinity, memory pre-allocation, lock-free data structures, cache line optimization, zero-copy inference, CUDA streams, and kernel profiling.

Low-Rank Compression of Pretrained Models via Randomized Subspace Iteration

The massive scale of pretrained models has made efficient compression essential for practical deployment. Low-rank decomposition based on the singular v...

Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration

Recently, scaling reinforcement learning with verifiable rewards (RLVR) for large language models (LLMs) has emerged as an effective training paradigm f...

Low-Resource Guidance for Controllable Latent Audio Diffusion

Generative audio requires fine-grained controllable outputs, yet most existing methods require model retraining on specific controls or inference-time c...

LPM 1.0: Video-based Character Performance Model

Performance, the externalization of intent, emotion, and personality through visual, vocal, and temporal behavior, is what makes a character alive. Lear...

LSTM and GRU Deep Dive

Master Long Short-Term Memory and Gated Recurrent Units - the architectures that solved vanishing gradients and powered a decade of sequence modeling breakthroughs.

Lyra 2.0: Explorable Generative 3D Worlds

Recent advances in video generation enable a new paradigm for 3D scene creation: generating camera-controlled videos that simulate scene walkthroughs, t...

M-CaStLe: Uncovering Local Causal Structures in Multivariate Space-Time Gridded Data

Causal graph discovery for space-time systems is challenging in high-dimensional gridded data, which often has many more grid cells than temporal observ...

MAAT: Multi-phase Adapter-Aware Targeted Unlearning

Machine unlearning evaluation is structurally skewed: Why-type questions, which probe causal and relational knowledge, comprise less than 0.06% of Count...

Macaron-A2UI: A Model for Generative UI in Personal Agents

As personal agents evolve to handle complex, user-centric tasks, static plain-text chat is rapidly becoming a bottleneck. Generative UI emerges as the n...

Machine Learning for Health (ML4H) 2024

Machine Learning for Health (ML4H) 2024 — published at ML4H@NeurIPS 2024.

Machine Learning for Health, ML4H@NeurIPS 2024, Vancouver, Canada, 15-16 December 2024

Machine Learning for Health, ML4H@NeurIPS 2024, Vancouver, Canada, 15-16 December 2024 — published at ML4H@NeurIPS 2024.

MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events.

MADE: A Living Benchmark for Multi-Label Text Classi... — published at ACL 2026.

MAGIC: Transition-Aware Generation of Navigable Multi-Scene Game Worlds with Large Language Models

Multi-scene navigation (clearing an objective in one bounded space and then crossing a portal into the next) is a defining feature of contemporary 3D ga...

Make Your LVLM KV Cache More Lightweight

Key-Value (KV) cache has become a de facto component of modern Large Vision-Language Models (LVLMs) for inference. While it enhances decoding efficiency...

Mamba - Selective State Space Models

How Mamba's input-dependent SSM parameters, hardware-aware parallel scan, and selective gating mechanism achieved linear-time sequence modeling competitive with transformers.

Mamba vs Transformer - When Each Wins

A rigorous benchmark comparison: perplexity, throughput, recall tasks, in-context learning, and the fundamental trade-off between compressed state and full context access.

ManifoldGD: Training-Free Hierarchical Manifold Guidance for Diffusion-Based Dataset Distillation

In recent times, large datasets hinder efficient model training while also containing redundant concepts. Dataset distillation aims to synthesize compac...

map, filter, reduce - Lazy Iteration and the Pipeline Model

Understand Python's map, filter, and reduce at engineering depth - lazy iterators, pipeline composition, functools.reduce and left-fold semantics, performance trade-offs, and when to prefer list comprehensions.

Mapping the Phase Diagram of the Vicsek Model with Machine Learning

In this study, we use machine learning to classify and interpolate the phase structure of the Vicsek flocking model across the three-dimensional paramet...

MARBLE: Multi-Aspect Reward Balance for Diffusion RL

Reinforcement learning fine-tuning has become the dominant approach for aligning diffusion models with human preferences. However, assessing images is i...

MARCO: Navigating the Unseen Space of Semantic Correspondence

Recent advances in semantic correspondence rely on dual-encoder architectures, combining DINOv2 with diffusion backbones. While accurate, these billion-...

MARS: Enabling Autoregressive Models Multi-Token Generation

Autoregressive (AR) language models generate text one token at a time, even when consecutive tokens are highly predictable given earlier context. We int...

Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness

Humans use introspection to evaluate their understanding through private internal states inaccessible to external observers. We investigate whether larg...

Masked Language Modeling and BERT

Understand how BERT learns bidirectional language representations using masked language modeling, its architecture, and how to fine-tune it for downstream tasks.

MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in si...

Matrix Factorization - Discovering Hidden Taste Dimensions

Master matrix factorization for recommendations - SVD, Funk SVD, SGD and ALS optimization, biases, regularization, and implicit feedback with BPR. The algorithm that won the Netflix Prize.

Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

With the advancement of interactive video generation, diffusion models have increasingly demonstrated their potential as world models. However, existing...

Matryoshka Representation Learning (MRL)

Nested embeddings where any prefix of dimensions is informative - training MRL, adaptive retrieval, 10x FLOP reduction, and how OpenAI's text-embedding-3 uses MRL internally.

Maximum Likelihood Estimation

Understand MLE from first principles - derive OLS from Gaussian noise, cross-entropy from Bernoulli, Fisher information, Cramér-Rao bound, and the deep connection between MLE and empirical risk minimization.

McMining: Automated Discovery of Misconceptions in Student Code.

McMining: Automated Discovery of Misconceptions in S... - published at EACL 2026.

MCP Architecture - Client-Server

Deep dive into MCP's client-server architecture - Host, Client, and Server roles; stdio and HTTP+SSE transport layers; JSON-RPC 2.0 message format; initialization handshake; capability negotiation; and full lifecycle.

MCP Ecosystem and Servers

The growing MCP ecosystem - official Anthropic servers, community landscape, MCP registries, evaluating third-party servers, IDE integrations, and patterns for building ecosystem vs. team-specific servers.

MCP Security and Permissions

Security model of the Model Context Protocol - attack surfaces including tool poisoning, resource injection, and confused deputy attacks, plus permission scoping, transport security, and a production security checklist.

MCP Tools, Resources, and Prompts

Deep dive into MCP's three primitives - Tools (callable functions), Resources (readable data), and Prompts (reusable templates) - with complete Python implementations of each.

MCP vs Function Calling

Deep architectural comparison of MCP and function calling - where each operates, when to use each, the decision matrix, hybrid patterns, and how to migrate from function calling to MCP.

MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models.

MCPEval: Automatic MCP-based Deep Evaluation for AI... - published at EMNLP 2025.

MDN: Parallelizing Stepwise Momentum for Delta Linear Attention

Linear Attention (LA) offers a promising paradigm for scaling large language models (LLMs) to long sequences by avoiding the quadratic complexity of sel...

MDP and the RL Framework

Master Markov Decision Processes - the mathematical foundation of all reinforcement learning. Understand states, actions, rewards, value functions, the Bellman equations, and how real-world systems are modeled as MDPs.

Mean Estimation from Coarse Data: Characterizations and Efficient Algorithms

Coarse data arise when learners observe only partial information about samples; namely, a set containing the sample rather than its exact value. This oc...

MeanFlow Meets Control: Scaling Sampled-Data Control for Swarms

Steering large-scale swarms in only a few control updates is challenging because real systems operate in sampled-data form: control inputs are updated i...

MeanFlowNFT: Bringing Forward-Process RL to Average-Velocity Generators

MeanFlow generators achieve fast few-step sampling by predicting average velocities over time intervals, making them attractive for efficient generation...

Measuring AI Product Quality

Build a production-grade quality measurement system for AI products using explicit feedback, implicit behavioral signals, LLM-as-judge, and composite scoring.

Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation

Recent work on chain-of-thought (CoT) faithfulness reports single aggregate numbers (e.g., DeepSeek-R1 acknowledges hints 39% of the time), implying tha...

Measuring HITL Effectiveness

End-to-end metrics for human-in-the-loop systems - false positive/negative rates, confidence calibration, inter-rater reliability, reviewer performance tracking, ROI computation, and system-level effectiveness dashboards.

MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts

Large language models (LLMs) are widely explored for reasoning-intensive research tasks, yet resources for testing whether they can infer scientific con...

MedGemma 1.5 Technical Report

We introduce MedGemma 1.5 4B, the latest model in the MedGemma collection. MedGemma 1.5 expands on MedGemma 1 by integrating additional capabilities: hi...

Medical Imaging AI

Deep learning for radiology and pathology - CNN architectures, DICOM pipelines, transfer learning from ImageNet to medical domains, and clinical deployment considerations including FDA clearance.

MedPMC: A Systematic Framework for Scaling High-Fidelity Medical Multimodal Data for Foundation Models

Medicine is inherently multimodal, requiring clinicians to synthesize information across diverse data streams. Yet the development of multimodal foundat...

MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills

Background: Agent skills are increasingly deployed as modular, reusable capability units in AI agent systems. Medical research agent skills require safe...

MegaStyle: Constructing Diverse and Scalable Style Dataset via Consistent Text-to-Image Style Mapping

In this paper, we introduce MegaStyle, a novel and scalable data curation pipeline that constructs an intra-style consistent, inter-style diverse and hi...

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. Unlike tr...

Mellum2 Technical Report

We present Mellum 2, an open-weight 12B-parameter Mixture-of-Experts (MoE) language model with 2.5B active parameters per token. Mellum 2 is a general-p...

Membership Inference

Determining whether specific data was used in model training - privacy risks, attack techniques, and defenses for production ML systems.

Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory

Large language model (LLM) agents are fundamentally bottlenecked by finite context windows on long-horizon tasks. As trajectories grow, retaining tool o...

MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

Long-term agent memory is increasingly multimodal, yet existing evaluations rarely test whether agents preserve the visual evidence needed for later rea...

MemForest: An Efficient Agent Memory System with Hierarchical Temporal Indexing

Memory is a fundamental component for enabling long-context LLM agents, supporting persistent state across interactions through a continuous serve-and-u...

MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models

Memory is essential for large vision-language models (LVLMs) to handle long, multimodal interactions, with two method directions providing this capabili...

Memory Allocators for ML

How glibc malloc, jemalloc, tcmalloc, and PyTorch's CUDA caching allocator work - with production techniques for eliminating memory fragmentation in ML training and serving.

Memory Bandwidth Roofline Analysis

Learn to apply the Roofline model to diagnose whether GPU kernels are memory-bound or compute-bound, calculate arithmetic intensity, and use roofline plots to guide real optimization decisions.

Memory by Design: Probabilistic Sequence Layers

We introduce the design-model framework: a way to derive efficient recurrent sequence maps from explicit assumptions about memory. A design model writes...

Memory Caching: RNNs with Growing Memory

Transformers have been established as the de-facto backbones for most recent advances in sequence modeling, mainly due to their growing memory capacity...

Memory Capacity Planning for LLMs

How to compute exact GPU memory requirements for LLM training and inference - model weights, optimizer states, activations, KV cache - and how to plan GPU cluster configurations for target models.

Memory Coalescing and Bank Conflicts

Master the two most impactful memory access patterns in CUDA - global memory coalescing and shared memory bank conflicts. Understand why identical computation with transposed access can be 8x slower, and how to fix both problems with layout changes and padding.

Memory Compression and Summarization

How to keep agents functional across days-long tasks by compressing memory intelligently - preserving what matters, discarding what does not.

Memory Hierarchy and Cache Design

Learn how CPU cache hierarchy works - L1/L2/L3 structure, associativity, eviction policies, MESI coherence, NUMA topology, and how to write cache-friendly code that runs 10x to 100x faster for ML workloads.

Memory Hierarchy in GPUs

Registers, L1/L2 cache, shared memory, and HBM - GPU memory hierarchy latency numbers, bandwidth characteristics, and how to write code that uses each level effectively.

Memory Intelligence Agent

Deep research agents (DRAs) integrate LLM reasoning with external tools. Memory systems enable DRAs to leverage historical experiences, which are essent...

Memory Models and Concurrency

Hardware memory models, memory barriers, atomic operations, lock-free data structures, and how memory ordering affects concurrent ML data pipelines and distributed training implementations.

Memory Optimization - Fitting More in Less

Reduce Python memory usage with __slots__, weakref, array module, struct.pack, memory-mapped files, object pooling, and the flyweight pattern for processing millions of records.

Memory Profiling - tracemalloc, memory_profiler, objgraph, and pympler

Profile and debug Python memory usage at engineering depth - sys.getsizeof shallow vs deep size, tracemalloc snapshots and leak detection, memory_profiler line-by-line analysis, objgraph retention paths, pympler recursive sizing, and practical workflows for diagnosing real-world memory leaks.

Memory Profiling and Debugging

A systematic toolkit for finding and fixing memory leaks in Python ML systems - from tracemalloc snapshots to GPU memory debugging, DataLoader leaks, and long-running service monitoring.

Memory Safety and Rust

Understand memory safety bugs in C/C++, how Rust's ownership model eliminates them at compile time, and why Rust is becoming the language of choice for high-performance ML infrastructure components.

Memory Systems: Short-Term and Long-Term

Designing memory systems for LLM agents - from in-context working memory to episodic retrieval, semantic knowledge bases, and procedural memory.

Memory Transfer Learning: How Memories are Transferred Across Domains in Coding Agents

Memory-based self-evolution has emerged as a promising paradigm for coding agents. However, existing approaches typically restrict memory utilization to...

MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems

Memory is essential for enabling large language models to support long-horizon reasoning, yet existing memory systems remain unreliable and difficult to...

MentalThink: Shaping Thoughts in Mental SVG World

We introduce MentalThink, a visual-symbolic reasoning paradigm that equips Multimodal LLMs (MLLMs) with an executable mechanism for 'mental' visualizati...

MergeKit - The Practical Toolkit

How to use arcee-ai/mergekit to merge language models with YAML configuration, CPU-compatible layer-by-layer processing, and automated HuggingFace Hub upload.

Merging and Model Soup Techniques

Combining multiple fine-tuned models without retraining - LoRA adapter merging, SLERP, TIES-merging, DARE, and MergeKit for production model merging that unlocks capabilities no single training run achieves.

Meritocratic Fairness in Budgeted Combinatorial Multi-armed Bandits via Shapley Values

We propose a new framework for meritocratic fairness in budgeted combinatorial multi-armed bandits with full-bandit feedback (BCMAB-FBF). Unlike semi-ba...

MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments

Motivated by the underspecified, multi-hop nature of search queries and the multimodal, heterogeneous, and often conflicting nature of real-world web re...

Message Passing Neural Networks

MPNN - the unified framework showing GCN, GraphSAGE, and GAT are special cases of a single message-passing paradigm with a fundamental 1-WL expressivity ceiling.

Message Queues and Kafka

Master Apache Kafka for ML data pipelines - topics, partitions, consumer groups, exactly-once semantics, real-time feature computation, prediction logging, and production patterns for ML platforms.

MET: Theory-Grounded and Culture-Aware Multilingual Moral Reasoning

Language models are increasingly used for moral decision-making across diverse linguistic and cultural contexts, yet existing work overlooks multilingua...

Meta-CoT: Enhancing Granularity and Generalization in Image Editing

Unified multi-modal understanding/generative models have shown improved image editing performance by incorporating fine-grained understanding into their...

Meta-learning In-Context Enables Training-Free Cross Subject Brain Decoding

Visual decoding from brain signals is a key challenge at the intersection of computer vision and neuroscience, requiring methods that bridge neural repr...

Meta-Reasoning Improves Tool Use in Large Language Models.

Meta-Reasoning Improves Tool Use in Large Language M... - published at NAACL 2025.

Metaclasses - The Class of Classes

Understand type as the metaclass of all classes, the full class creation pipeline, __new__, __init__, __call__ on metaclasses, __prepare__, metaclass inheritance and conflicts, and real-world usage in Django, SQLAlchemy, and ABC.

Metacognition in LLMs: Foundations, Progress, and Opportunities

Metacognition is a foundational component of intelligence critical to effective learning, problem solving, decision-making, communication, and more. In...

Metadata Filtering with Vector Search

Master pre-filtering vs post-filtering, the ACORN algorithm for filtered HNSW, namespace sharding for multi-tenancy, payload index design, and performance impact of filters in vector databases.

Metaflow

Building scalable, reproducible ML workflows with Netflix's Metaflow - the flow-step model, cloud compute with @batch and @kubernetes, and Cards for documentation.

MetaphorVU: Towards Metaphorical Video Understanding

Metaphorical videos are prevalent across various real-world scenarios to convey complex ideas, and understanding them typically requires high-order cogn...

MetaView: Monocular Novel View Synthesis with Scale-Aware Implicit Geometry Priors

Current visual generation models are capable of producing high-quality content, yet they lack a coherent perception of the spatial structure. Existing g...

MiA-Signature: Approximating Global Activation for Long-Context Understanding

A growing body of work in cognitive science suggests that reportable conscious access is associated with global ignition over distributed memory systems...

Micro Language Models Enable Instant Responses

Edge devices such as smartwatches and smart glasses cannot continuously run even the smallest 100M-1B parameter language models due to power and compute...

Microservices for ML Systems

Learn when and how to decompose ML systems into microservices - covering feature services, model services, service mesh, gRPC, and circuit breakers.

Microservices vs Monolith - Making the Right Choice

Navigate the monolith-to-microservices spectrum with Python - bounded contexts, communication patterns, the modular monolith, and practical decision frameworks.

Middleware - Wrapping Every Request and Response

Master middleware at engineering depth - WSGI vs ASGI middleware, the onion model, request ID propagation, timing, structured logging, CORS, rate limiting with Redis, JWT authentication, and when to use middleware vs dependency injection.

Mimic Intent, Not Just Trajectories

While imitation learning (IL) has achieved impressive success in dexterous manipulation through generative modeling and pretraining, state-of-the-art ap...

Mind the Gap: Structure-Aware Consistency in Preference Learning

Preference learning has become the foundation of aligning Large Language Models (LLMs) with human intent. Popular methods, such as Direct Preference Opt...

Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

Multimodal large language models (MLLMs) have achieved impressive progress on vision language benchmarks, yet their capacity for visual cognitive and vi...

MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

Current document parsing methods compete primarily on model architecture innovation, while systematic engineering of training data remains underexplored...

MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

Recent progress in multimodal large language models (MLLMs) has brought AI capabilities from static offline data processing to real-time streaming inter...

Minimax Generalized Cross-Entropy

Loss functions play a central role in supervised classification. Cross-entropy (CE) is widely used, whereas the mean absolute error (MAE) loss can offer...

MinShap: A Modified Shapley Value Approach for Feature Selection

Feature selection is a classical problem in statistics and machine learning, and it continues to remain an extremely challenging problem especially in t...

minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

Recent video diffusion foundation models have achieved remarkable progress in high-quality video generation, yet turning them into real-time interactive...

Mirror in the Model: Ad Banner Image Generation via Reflective Multi-LLM and Multi-modal Agents.

Mirror in the Model: Ad Banner Image Generation via... - published at EMNLP 2025.

MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

DeepSeek Sparse Attention (DSA) sets the state of the art for fine-grained inference-time sparse attention by introducing a learned token-wise indexer t...

Mistral and Mixtral Architecture

Mistral 7B's sliding window attention and grouped query attention innovations, and Mixtral 8x7B's Mixture of Experts design - sparse routing, expert selection, and why MoE delivers 70B quality at 13B active parameter cost.

Mitigating Copy Bias in In-Context Learning through Neuron Pruning.

Mitigating Copy Bias in In-Context Learning through... - published at EACL 2026.

Mitigating Multimodal Hallucination via Phase-wise Self-reward

Large Vision-Language Models (LVLMs) still struggle with vision hallucination, where generated responses are inconsistent with the visual input. Existin...

Mixed Precision and Quantization Kernels

Learn how to write correct and fast kernels for FP16, BF16, FP8, INT8, and INT4 quantized models - including the pipeline mistakes that make INT8 slower than FP16.

MixFlow: Mixed Source Distributions Improve Rectified Flows

Diffusion models and their variations, such as rectified flows, generate diverse and high-quality images, but they are still hindered by slow iterative...

Mixtral 8x7B - Architecture Deep Dive

Mistral AI's Mixtral 8x7B architecture - 8 experts with top-2 routing, sliding window attention, multilingual training, performance vs. Llama 2 70B, and serving requirements.

Mixture of Experts Architecture

The architecture of sparse MoE models - how expert networks replace dense FFN layers, top-k routing, and how parameter count relates to active compute.

ML Cost Models

Learn to build a complete ML cost model - from compute and storage to hidden data transfer costs - so your team never gets blindsided by a $300K quarterly cloud bill.

ML Deployment Patterns - From Jupyter Notebook to Production at Scale

A comprehensive guide to ML deployment strategies, serving architectures, optimization techniques, and model registry practices for shipping models safely at scale.

ML Infrastructure Cost Model

Understanding what drives ML costs - building a cost-per-request model for your ML system from scratch, and computing unit economics the CTO will believe.

ML Pipeline Orchestration Concepts

Understand the fundamental concepts behind ML pipeline orchestration - DAGs, dependency management, idempotency, and why cron jobs are a silent disaster for production ML.

ML Platform Design

Designing an internal ML platform for a team of 50 data scientists - feature stores, experiment tracking, model registry, serving infrastructure, and platform adoption strategies.

ML Platform Design

Learn how to design internal ML platforms that enable data scientists and engineers to train, deploy, and monitor models efficiently - covering platform components, build vs buy, and real-world case studies.

ML ROI and Business Cases

Build iron-clad ROI cases for ML investments - from quantifying recommendation system value to attributing A/B test results to long-term business outcomes.

MLflow Deep Dive

Production MLflow setup for teams - tracking server architecture, autologging, custom logging, model registry, nested runs for HPO, and scaling to 500+ experiments per week.

MLflow Model Registry in Production

Learn how to use the MLflow Model Registry to manage model versions, stages, approval workflows, and webhooks for production ML teams.

MLOps Platform Architecture

Understand the MLOps maturity model from Level 0 to Level 3, design the components of a complete ML platform, and build a realistic 12-month roadmap from ad-hoc to automated.

MLOps vs DevOps

How MLOps extends DevOps principles to handle the unique challenges of data, model quality, and concept drift that traditional software CI/CD cannot address.

MLX for Apple Silicon

Apple's MLX framework for running and fine-tuning LLMs on M-series chips - unified memory architecture, lazy evaluation, mlx-lm for inference, LoRA fine-tuning, and benchmarking against llama.cpp.

MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge

Multimodal Large Language Models (MLLMs) have been increasingly used as automatic evaluators-a paradigm known as MLLM-as-a-Judge. However, their reliabi...

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

The rapid progress of Artificial Intelligence Generated Content (AIGC) tools enables images, videos, and visualizations to be created on demand for webp...

MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings

We present MMCORE, a unified framework designed for multimodal image generation and editing. MMCORE leverages a pre-trained Vision-Language Model (VLM)...

MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control

MLLMs have been successfully applied to multimodal embedding tasks, yet their generative reasoning capabilities remain underutilized. Directly incorpora...

MMSkills: Towards Multimodal Skills for General Visual Agents

Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior primarily as te...

MNAFT: modality neuron-aware fine-tuning of multimodal large language models for image translation

Multimodal large language models (MLLMs) have shown impressive capabilities, yet they often struggle to effectively capture the fine-grained textual inf...

Mobile GUI Agent Privacy Personalization with Trajectory Induced Preference Optimization

Mobile GUI agents powered by Multimodal Large Language Models (MLLMs) can execute complex tasks on mobile devices. Despite this progress, most existing...

Mobile GUI Agents under Real-world Threats: Are We There Yet?

Recent years have witnessed a rapid development of mobile GUI agents powered by large language models (LLMs), which can autonomously execute diverse dev...

MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research

We present MobileGym, a browser-hosted, lightweight, fully controllable environment for everyday mobile use, targeting interaction fidelity without repl...

MobileMoE: Scaling On-Device Mixture of Experts

Mixture-of-Experts (MoE) has become the de facto architecture for hundred-billion-parameter language models, yet its advantages at sub-billion scales fo...

Mocking - Patch Where the Name Is Used, Not Where It Is Defined

Master Python mocking at engineering depth - the golden patching rule, Mock vs MagicMock, patch as decorator and context manager, autospec, side_effect, AsyncMock, pytest-mock, and the typo that silently passes your tests.

Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs

Multimodal LLMs can process speech and images, but they cannot hear a speaker's voice or see an object's texture. We show this is not a failure of encod...

Mode Seeking meets Mean Seeking for Fast Long Video Generation

Scaling video generation from seconds to minutes faces a critical bottleneck: while short-video data is abundant and high-fidelity, coherent long-form d...

Model Agreement via Anchoring

Numerous lines of aim to control $ extit{model disagreement}$ -- the extent to which two machine learning models disagree in their predictions. We adop...

Model Capability Dominates: Inference-Time Optimization Lessons from AIMO 3

Majority voting over multiple LLM attempts improves mathematical reasoning, but correlated errors limit the effective sample size. A natural fix is to a...

Model Cards and Documentation

How to write, automate, and maintain model cards that document model capabilities, limitations, training data, fairness evaluations, and regulatory compliance.

Model Compilation and Optimization

Compiler-level optimizations for ML inference - TensorRT, torch.compile, ONNX export, kernel fusion, layer fusion, XLA, and profiling bottlenecks.

Model Efficiency Economics

Analyze the accuracy-cost Pareto frontier to determine when model improvements are economically justified - and how to build the business case for the current model being cost-optimal.

Model Evaluation Gates

Design automated model quality gates that block promotion when a model fails on demographic subgroups - not just on aggregate metrics.

Model Extraction

Querying a model API to reconstruct its weights, replicate its behavior, or steal proprietary training data through systematic probing.

Model Fallback and Retry

Design resilient LLM clients with configurable fallback chains, exponential backoff with jitter, and circuit breakers that handle provider failures gracefully without any user-facing impact.

Model Licensing and Compliance

Open-source model licenses are not all the same. Learn Apache 2.0, LLaMA Community, RAIL, and custom licenses - what you can and cannot do in production, and how to build a compliance workflow.

Model Monitoring Platform

Build production model monitoring infrastructure that catches data drift, prediction drift, and concept drift - detecting model degradation within 24 hours instead of two months.

Model Performance Monitoring

Monitoring model quality in production - the ground truth delay problem, proxy metrics, shadow evaluation, cohort-based monitoring, SLOs for model quality, and detecting degradation before it hurts the business.

Model Quantization for Production Inference

How quantization reduces model size and inference latency - from FP32 to INT8 to INT4 - covering PTQ, QAT, GPTQ, AWQ, and GGUF with accuracy tradeoffs.

Model Registry and Versioning

Design a model registry that enables 3-minute rollbacks, full model lineage, and controlled staging-to-production promotion - turning model lifecycle management from a manual process into a reliable system.

Model Registry Concepts

Understand what a model registry is, why it exists, and how it brings order to the chaos of managing ML models in production.

Model Rollback Strategies

Designing fast, reliable model rollback procedures for when production models degrade - covering registry-based rollback, infrastructure rollback, and automated rollback controllers.

Model Selection and Parameter Estimation of Multi-dimensional Gaussian Mixture Model

In this paper, we study the problem of learning multi-dimensional Gaussian Mixture Models (GMMs), with a specific focus on model order selection and eff...

Model Selection Strategy - Choosing the Right Model for the Right Problem

A systematic framework for selecting model families, managing complexity budgets, tuning hyperparameters, and knowing when AutoML helps versus hurts.

Model Staging and Promotion

How to safely gate model promotion through staging, production, and archiving with automated checks and human approval workflows.

Model Versioning and Canary Releases

Managing model versions in production LLM serving - semantic versioning for models, canary deployments, A/B testing, shadow mode evaluation, rollback procedures, and blue-green model deployments.

Model Versioning Strategies

Design versioning schemes for ML models that support safe rollbacks, A/B testing, champion/challenger management, and backward compatibility.

Model-Agnostic Signal Discovery with Machine Learning: Bridging the Gap Between Theory and Practice

Searches for new phenomena in complex scientific data are predominantly model-dependent, optimized for specific hypotheses, and therefore limited in the...

Modeling Multiple Support Strategies within a Single Turn for Emotional Support Conversations

Emotional Support Conversation (ESC) aims to assist individuals experiencing distress by generating empathetic and supportive dialogue. While prior work...

Modeling Sparse and Bursty Vulnerability Sightings: Forecasting Under Data Constraints

Understanding and anticipating vulnerability-related activity is a major challenge in cyber threat intelligence. This work investigates whether vulnerab...

Models That Know How Evaluations Are Designed Score Safer

The validity of AI safety evaluations depends on models behaving consistently across controlled and deployment settings. Prior work has identified test-...

Modern Alignment Techniques

Survey the post-RLHF alignment landscape - RLAIF, Constitutional AI, rejection sampling fine-tuning, iterative DPO, process reward models, and the open questions shaping the next generation of aligned models.

MoDora: Tree-Based Semi-Structured Document Analysis System

Semi-structured documents integrate diverse interleaved data elements (e.g., tables, charts, hierarchical paragraphs) arranged in various and often irre...

Module 01: Agentic Foundations

Master the foundational concepts of AI agents - what they are, how they reason, how they act, and when to use them.

Module 01: LLMOps - Overview

An overview of LLMOps - the engineering discipline for building, shipping, and operating production LLM applications reliably and at scale.

Module 01: Systems Foundations

Master the foundational principles of AI system design - from requirements gathering to distributed systems theory applied to machine learning.

Module 01: Transformer Architecture

A complete guide to the transformer architecture - the foundation of every modern large language model.

Module 02 - Functional Programming Overview

Master Python's functional programming model at engineering depth - lambdas, map/filter/reduce, generators, iterators, decorators, closures, pure functions, immutability, functools, and partial application and currying.

Module 02: AI Observability - Overview

An overview of AI observability - tracing, quality metrics, feedback collection, and alerting for production LLM applications.

Module 02: Experiment Tracking

Systematic tracking of ML experiments - hyperparameters, metrics, artifacts, and models - so your team can reproduce results, compare runs, and ship better models faster.

Module 03 - Python Internals Overview

Understand CPython's implementation details at engineering depth - bytecode, the eval loop, the GIL, reference counting, garbage collection, memory profiling, sys/inspect, and the import system.

Module 03: Computer Use Agents

How AI agents see, understand, and interact with graphical interfaces - browsers, desktops, and GUIs - using vision models and action executors.

Module 03: Data Versioning

Versioning datasets as first-class artifacts - DVC, Delta Lake, dataset lineage, data contracts, and managing ML datasets at scale.

Module 03: LLM Gateways

Learn how to build and operate a production LLM gateway - the unified infrastructure layer for routing, caching, cost control, and observability across every AI service your team runs.

Module 03: Model Serving

Production patterns for serving ML model predictions - from protocol choice and batching to quantization, compilation, caching, and autoscaling.

Module 03: Prompt Engineering

Master the art and science of communicating with large language models - from basic zero-shot instructions to automated prompt optimization with DSPy.

Module 04 - Testing and Quality Overview

Build production-grade test suites at engineering depth - unittest, pytest, mocking, TDD, code coverage, linting, and pre-commit hooks that enforce quality at every commit.

Module 04: Coding Agents

Coding agents are the most commercially successful form of agentic AI. Learn how GitHub Copilot, Cursor, Devin, and Claude Code work under the hood.

Module 04: RAG Systems

Master Retrieval-Augmented Generation - the dominant pattern for grounding LLMs in external knowledge at production scale.

Module 04: Real-Time ML Systems

Architecture patterns for real-time machine learning - from sub-10ms inference at scale to online learning, streaming inference pipelines, and ultra-low-latency optimization.

Module 04: Synthetic Data

Learn to generate, filter, and use synthetic training data at scale - from Self-Instruct bootstrapping to Evol-Instruct complexity evolution, distillation datasets, and RAG evaluation corpora.

Module 05: CI/CD for ML

Build CI/CD pipelines that catch ML-specific failures - not just broken code, but broken models.

Module 05: Long-Horizon Planning

How agents decompose complex multi-step tasks, plan across long horizons, recover from failures, and know when to ask for help.

Module 05: ML Architecture Patterns

A deep dive into the architectural patterns that power production ML systems - from Lambda/Kappa to multi-tenant platforms.

Module 06 - APIs and Web Basics

Master HTTP at the wire level, REST design principles, Flask, FastAPI, request/response lifecycle, middleware, JSON serialization, and Pydantic validation - the complete engineering foundation for building production web APIs in Python.

Module 06 - Security Engineering

Master security engineering in Python - cryptographic hashing, JWT authentication, OAuth 2.0, input validation, SQL injection prevention, secrets management, and secure coding patterns that protect production systems from real-world attacks.

Module 06: Agent Memory

How agents store, retrieve, and manage knowledge across interactions - working memory, episodic memory, semantic memory, procedural memory, and cross-session persistence.

Module 06: Case Studies

Real-world end-to-end case studies of production ML systems - recommendation, search, fraud, content moderation, ad click prediction, and LLM-powered products.

Module 06: Containerization

Master Docker and containers for ML - from Dockerfiles to GPU containers, image optimization, and Docker Compose for reproducible ML development environments.

Module 06: LLM Evaluation

A complete guide to evaluating large language models - from perplexity to production monitoring.

Module 07: LLM Inference & Optimization

Master the systems and techniques that make large language model inference fast, efficient, and cost-effective at production scale.

Module 07: Multi-Agent Systems

Orchestration, communication, parallelism, and real frameworks - from first principles to production multi-agent systems.

Module 07: Production AI Patterns

Battle-tested engineering patterns for deploying LLM applications at scale - context management, streaming, async calls, batching, retries, cost optimization, multi-tenancy, and AI product architecture.

Module 08: Agent Evaluation

Evaluation is the most underrated problem in agentic AI. Without it, you cannot improve, catch regressions, or build trust. This module covers trajectory scoring, benchmarks, LLM-as-judge, human evaluation, and production monitoring.

Module 08: AI Product Engineering

Design, build, and ship AI-powered products that users trust - streaming UX, latency management, error handling, rollout strategies, personalization, and quality measurement.

Module 08: Multimodal Models

Understanding how modern AI systems process images, audio, and text together - from CLIP to diffusion to production pipelines.

Module 09: Agent Safety

Risk taxonomy, minimal footprint, prompt injection defense, guardrails, human oversight, sandboxing, and responsible deployment.

Module 09: Human-in-the-Loop

Master human-in-the-loop AI systems - annotation pipelines, active learning, feedback collection, escalation patterns, and measuring HITL effectiveness.

Module 09: LLM System Design

Production architecture for AI-powered products - from prototype to reliable, scalable, cost-efficient systems.

Module 1 - MLOps Foundations

Understand what MLOps is, why it exists, and how to think about operationalizing machine learning systems in production.

Module 1: Computer Architecture for ML Engineers

CPU architecture, memory hierarchy, SIMD vectorization, NUMA, and hardware performance analysis - understanding the machine your ML code runs on.

Module 1: GPU Architecture

How GPUs work at the silicon level - streaming multiprocessors, tensor cores, memory hierarchy, and the roofline model that explains every ML performance optimization.

Module 1: The Open Source LLM Ecosystem

The open source LLM landscape - Llama, Mistral, Qwen, Gemma, Phi, model families, model cards, and a framework for choosing the right model for your task.

Module 10 - AI Platform Engineering

Build the internal platform that lets data scientists ship models to production in days, not months - covering MLOps architecture, experiment tracking, CI/CD for ML, and Kubernetes-native ML infrastructure.

Module 10: Agent Frameworks

LangChain, LangGraph, LlamaIndex, CrewAI, AutoGen, raw API - an honest comparison with production lessons.

Module 10: Cloud ML Platforms

Master AWS SageMaker, Google Vertex AI, Azure ML, Databricks, and cloud cost optimization strategies for production ML systems.

Module 10: Reasoning Models

How modern LLMs learn to think - test-time compute, chain-of-thought, process reward models, and the architectures behind o1, o3, and DeepSeek-R1.

Module 11 - A/B Testing and Experimentation

Learn how to design, run, and analyze experiments for ML systems - from statistical foundations to production experimentation platforms.

Module 11: Mixture of Experts

How sparse MoE models achieve massive capacity at lower compute cost - routing mechanisms, load balancing, Mixtral, and DeepSeek's innovations.

Module 12 - LLMOps Pipelines

Operationalize LLM-based systems - prompt management, evaluation pipelines, observability, RAG operations, and fine-tuning infrastructure.

Module 12: State Space Models

A complete map of State Space Models - from the quadratic attention bottleneck to Mamba's selective recurrence, hybrid architectures, and production deployment.

Module 13 - Infrastructure as Code for ML

Master Infrastructure as Code for ML systems - Terraform, Pulumi, GitOps, secret management, and cost optimization through declarative infrastructure.

Module 13: Structured Generation

A complete map of structured generation - from the reliability problem with free-text LLM output to constrained decoding, Outlines, Instructor, JSON mode, and production-grade extraction pipelines.

Module 14 - Feature Engineering

Feature engineering as an MLOps discipline - from raw data to production-grade feature pipelines, stores, and monitoring.

Module 14 Overview - Model Merging

How to combine multiple fine-tuned language models into a single, more capable model without any additional training.

Module 15 - Cost Management for ML

Financial operations for ML systems - understanding costs, optimizing training and inference, cloud FinOps, build vs. buy analysis, and cost attribution.

Module 15 Overview - Long Context Strategies

How modern LLMs handle extremely long inputs - from the fundamental O(n²) attention problem to RoPE scaling, context compression, and production engineering for 128K+ context windows.

Module 16 - Alignment and Safety

A complete guide to AI alignment, RLHF, Constitutional AI, DPO, red teaming, jailbreaks, safety evaluations, and the global regulatory landscape.

Module 17 - Embeddings Engineering

A complete guide to embeddings - models, evaluation (MTEB), fine-tuning, Matryoshka embeddings, quantization, multimodal embeddings, and production pipelines.

Module 2 - Data Infrastructure

A complete map of the Data Infrastructure module covering data lakes, Spark, Kafka, feature stores, data quality, Delta Lake, and lakehouse architecture for ML.

Module 2: AI in Healthcare

Building ML systems under HIPAA constraints and FDA regulation - medical imaging, clinical NLP, drug discovery, and patient outcome prediction.

Module 2: CUDA Programming

Write GPU kernels from scratch - thread hierarchy, memory spaces, coalescing, warp divergence, and profiling with Nsight - the foundation for understanding every ML framework under the hood.

Module 2: Model Context Protocol

A module map of the Model Context Protocol - from core concepts through architecture, primitives, building servers, security, ecosystem, and comparison with function calling.

Module 2: Operating Systems for ML

Virtual memory, process scheduling, huge pages, memory-mapped files, and OS-level tuning - the operating system layer that determines whether your ML workload runs fast or fights the kernel.

Module 2: Running Models Locally

llama.cpp, Ollama, and LM Studio - run any open source model on your own hardware, understand memory requirements, and set up a local development environment.

Module 3: AI in Legal

Contract analysis, legal research automation, compliance monitoring, and document review at scale - building AI where hallucination is malpractice and every output needs a citation.

Module 3: Compilers and Runtimes for ML

How compilers work, JIT compilation, MLIR, XLA, torch.compile, and TensorRT - understanding the compilation stack that turns your Python model into fast machine code.

Module 3: Custom Silicon for AI

TPUs, Trainium, Groq LPU, Cerebras WSE, Intel Gaudi, and Apple Silicon - how each architecture differs from GPUs and what workloads each wins on.

Module 3: LoRA and QLoRA Fine-Tuning

Fine-tune any open source model on your data without owning a data center - LoRA theory, QLoRA 4-bit training, hyperparameter selection, and getting a specialized model into production.

Module 3: Stream Processing for Real-Time AI

Eight lessons covering Apache Kafka, Apache Flink, stream processing patterns, real-time feature computation, and production reliability for ML systems that cannot tolerate batch latency.

Module 4 - Model Registry and Lifecycle

Master the model registry - the system that brings order, traceability, and governance to every model your team ships to production.

Module 4: AI in Retail

Demand forecasting, personalization at scale, dynamic pricing, inventory optimization, and supply chain AI - the ML systems behind recommendations and prices.

Module 4: Kernel Optimization

FlashAttention, Triton, operator fusion, torch.compile, and XLA - making neural network operations faster by understanding what the hardware actually does with your compute.

Module 4: Memory Management for ML

Stack and heap allocation, Python memory model, GPU memory patterns, memory profiling, and zero-copy data transfer - debugging OOM errors and building memory-efficient pipelines.

Module 4: Quantization in Practice

GGUF, GPTQ, AWQ, and bitsandbytes - compress models to fit your hardware budget while understanding exactly what quality you are trading away and why.

Module 5: AI in Manufacturing

Predictive maintenance, computer vision for quality control, digital twins, and process optimization - deploying ML on the factory floor where downtime costs thousands per minute.

Module 5: Fine-Tuning Pipelines

Production fine-tuning with Axolotl - dataset formatting, multi-GPU training, DPO preference tuning, and managing adapter versions across model releases.

Module 5: LLM Agents - Overview

LLM agents as autonomous systems that reason, plan, and act using tools, memory, and multi-agent coordination.

Module 5: Memory Systems for AI

HBM, DRAM, cache hierarchies, KV cache management, PagedAttention, and quantization as memory compression - understanding memory is understanding why LLM inference costs what it costs.

Module 5: Networking for Distributed AI

TCP/IP fundamentals, RDMA, AllReduce algorithms, gRPC for model serving, and network bottlenecks in distributed training - the networking layer that determines whether your training job scales.

Module 6 - AI Security

Comprehensive coverage of AI security threats, attack vectors, and defenses for production AI systems.

Module 6: AI in EdTech

Adaptive learning systems, AI-powered assessment, knowledge tracing, and personalized tutoring - building educational AI that actually improves learning outcomes.

Module 6: Algorithms for ML Engineers

Algorithmic complexity in the context of ML - hash maps for embeddings, approximate nearest neighbor data structures, sampling at scale, and the algorithmic foundations of attention.

Module 6: Distributed Training Hardware

NVLink, InfiniBand, AllReduce algorithms, network topology, fault tolerance, and the hardware that makes training at thousands of GPUs possible.

Module 6: Evaluating Open Models

Build eval suites that give real signal - benchmark contamination, domain-specific evaluation, LLM-as-judge for open models, and regression testing after fine-tuning.

Module 7 - ML Pipeline Orchestration

Master the tools and patterns for orchestrating reliable, production-grade ML pipelines using Airflow, Prefect, Kubeflow, ZenML, and beyond.

Module 7 - Vector Database Engineering

Master vector similarity search, ANN algorithms, embedding pipelines, hybrid search, and production vector database deployment.

Module 7: Inference Hardware

Hardware selection for inference workloads - cost-per-token analysis, batching tradeoffs, edge hardware, speculative decoding implications, and building a complete inference stack.

Module 7: Production Deployment of Open Models

vLLM, Text Generation Inference, multi-adapter serving, autoscaling, and cost analysis - deploying open source models at production scale.

Module 7: Systems Programming for ML Engineers

C++ basics for ML engineers, Python C extensions, Cython, Pybind11, and writing custom PyTorch operators - bridging the gap between Python ML code and high-performance native implementations.

Module 8 - GPU and TPU Infrastructure

Master GPU architecture, memory management, distributed training, fault-tolerant clusters, TPU workloads, inference hardware, and cost optimization for ML infrastructure.

Module 8 - Kubernetes for ML

A complete guide to running machine learning workloads on Kubernetes, from fundamentals to GPU scheduling, training jobs, model serving, Helm, and multi-tenant clusters.

Module 9 - Cost & FinOps for AI

Master AI infrastructure economics - from cost modeling to FinOps culture - so you can build powerful systems without burning your budget.

Module 9 - Monitoring and Observability

Complete ML monitoring and observability - data drift detection, model performance monitoring, Prometheus/Grafana for ML, distributed tracing, alerting, and production monitoring tools like EvidentlyAI and NannyML.

MolmoAct2: Action Reasoning Models for Real-world Deployment

Vision-Language-Action (VLA) models aim to provide a single generalist controller for robots, but today's systems fall short on the criteria that matter...

MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

Web agents--autonomous systems that navigate and execute tasks on the web on behalf of users--have the potential to transform how people interact with t...

Moment Matters: Mean and Variance Causal Graph Discovery from Heteroscedastic Observational Data

Heteroscedasticity -- where the variance of a variable changes with other variables -- is pervasive in real data, and elucidating why it arises from the...

Monitoring and Debugging Fine-Tuning

How to monitor LLM fine-tuning runs and debug failures - tracking loss curves, gradient norms, GPU utilization, MFU, and diagnosing NaN loss, overfitting, and OOM errors in LoRA and full fine-tuning.

Monitoring LLM Services

Production observability for LLM serving systems - GPU metrics, TTFT, inter-token latency, vLLM Prometheus integration, distributed tracing, alerting, and Grafana dashboards.

Monitoring ML Serving in Production

Production monitoring for ML serving - inference latency histograms, GPU metrics, throughput monitoring, error rates, distributed tracing with OpenTelemetry, and drift detection.

MonkeyOCRv2: A Visual-Text Foundation Model for Document AI

Mainstream visual encoders are pretrained on natural images and cannot be effectively applied to document images without document-oriented adaptation, a...

Monte Carlo and Observability Platforms

Monte Carlo, Bigeye, and Soda - managed data observability.

Monte Carlo Tree Search for LLM Reasoning

Adapting MCTS to language model reasoning - selection, expansion, simulation, backpropagation over reasoning steps, AlphaCode 2, Tree-of-Thought, and production trade-offs.

MoRight: Motion Control Done Right

Generating motion-controlled videos--where user-specified actions drive physically plausible scene dynamics under freely chosen viewpoints--demands two...

MORPHOGEN: A Multilingual Benchmark for Evaluating Gender-Aware Morphological Generation.

MORPHOGEN: A Multilingual Benchmark for Evaluating G... — published at ACL 2026.

MotiMotion: Motion-Controlled Video Generation with Visual Reasoning

Current motion-controlled image-to-video generation models rigidly follow user-provided trajectories that are often sparse, imprecise, and causally inco...

Motion-Aware Caching for Efficient Autoregressive Video Generation

Autoregressive video generation paradigms offer theoretical promise for long video synthesis, yet their practical deployment is hindered by the computat...

Motion4Motion: Motion Transfer Across Subjects at Inference

This work explores the motion transfer from one video to another, which is crucial in animation for diverse characters. Previously, video motion transfe...

MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation

Recent Speech-to-Speech Translation (S2ST) systems achieve strong semantic accuracy yet consistently strip away non-verbal vocalizations (NVs), such as...

MpoxVLM: A Vision-Language Model for Diagnosing Skin Lesions from Mpox Virus Infection

In the aftermath of the COVID-19 pandemic and amid accelerating climate change, emerging infectious diseases, particularly those arising from zoonotic s...

MRO - Method Resolution Order and the C3 Linearisation Algorithm

Understand Python's Method Resolution Order at engineering depth - the diamond problem, C3 linearisation step by step, how super() traverses the MRO (not just "calls parent"), mixin patterns that depend on MRO, Django/Flask examples, and MRO failure cases.

MRT: Masked Region Transformer for Layered Image Generation and Editing at Scale

Layered image generation and editing is a fundamental capability that enables layer-wise reuse, editing, and composition of generated visual content, an...

MT-PingEval: Evaluating Multi-Turn Collaboration with Private Information Games

We present a scalable methodology for evaluating language models in multi-turn interactions, using a suite of collaborative games that require effective...

MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models

Full-Duplex Speech Language Models (FD-SLMs) enable real-time, overlapping conversational interactions, offering a more dynamic user experience compared...

MULSUM: A Multimodal Summarization System with Vis-Aligner and Diversity-Aware Image Selection.

MULSUM: A Multimodal Summarization System with Vis-A... - published at EACL 2026.

Multi-Agent Architectures

Building systems where multiple specialized LLM agents collaborate through orchestrator-worker, pipeline, and peer-to-peer patterns using LangGraph and CrewAI.

Multi-Agent LLMs Fail to Explore Each Other

Exploration is essential for reliable autonomy in multi-agent systems, yet it remains unclear whether large language model (LLM) agents can explore effe...

Multi-Armed Bandits

Use multi-armed bandit algorithms to adaptively allocate traffic during experiments - learning faster than A/B tests while reducing regret.

Multi-Cloud Data Strategies for AI Workloads

What multi-cloud data architectures do for AI systems, when vendor lock-in and data gravity risks threaten the portability of ML training and serving infrastructure, and how to design resilient multi-cloud strategies for production AI data pipelines.

Multi-GPU Training Architectures

Master data parallelism, tensor parallelism, pipeline parallelism, and 3D parallelism for large-scale model training - with communication volume math, PyTorch DDP vs FSDP, and Megatron-LM weight splitting strategies.

Multi-Head Attention

How multi-head attention enables transformers to jointly attend to information from multiple representation subspaces simultaneously.

Multi-Model Serving

How to serve hundreds of models efficiently - model multiplexing, ensembles in production, A/B testing infrastructure, shadow mode, canary deployments, and multi-tenant GPU resource isolation.

Multi-Model Serving Architecture

Serving multiple LLMs from shared infrastructure - model routing, MIG partitioning, dynamic loading, LiteLLM proxy, cost optimization through bin-packing, and autoscaling per model in production.

Multi-Task Learning Systems

How production ML systems share representations across multiple objectives simultaneously - covering hard vs soft parameter sharing, loss balancing, gradient conflicts, and negative transfer detection.

Multi-Task Pre-Finetuning of Lightweight Transformer Encoders for Text Classification and NER.

Multi-Task Pre-Finetuning of Lightweight Transformer... - published at EMNLP 2025.

Multi-Tenant AI Systems

Isolating context, costs, and data across tenants in multi-tenant AI products.

Multi-Tenant ML Platforms

Learn how to design ML platforms that safely serve multiple teams from shared GPU infrastructure - covering Kubernetes isolation, fair scheduling, data isolation, cost attribution, and quota management.

Multi-User Large Language Model Agents

Large language models (LLMs) and LLM-based agents are increasingly deployed as assistants in planning and decision making, yet most existing systems are...

Multi-view Consistent 3D Gaussian Head Avatars 'without' Multi-view Generation

High-fidelity 3D Gaussian head avatar generation is critical for applications such as AR/VR, telepresence, and digital humans. Existing methods depend o...

Multicore and NUMA Architecture

Learn how multicore CPUs and NUMA topology affect ML workload performance - cache coherence overhead, CPU affinity, NUMA-aware memory allocation, hyperthreading, and configuring PyTorch DataLoader for optimal hardware utilization.

Multilingual Self-Taught Faithfulness Evaluators.

Multilingual Self-Taught Faithfulness Evaluators. - published at EACL 2026.

Multimodal Embeddings

CLIP, SigLIP, ImageBind, ColPali, and CLAP - embedding images, text, audio, and documents in shared vector spaces for cross-modal search and zero-shot classification.

Multimodal Open Source Models

How open-source vision-language models work - from CLIP vision encoders and projection layers to LLaVA, InternVL2, and LLaMA 3.2 Vision - and how to deploy them for document understanding, OCR, and visual reasoning in production.

Multimodal RAG

How to build retrieval-augmented generation systems that can retrieve and reason over images, PDFs with figures, slides, and mixed-media documents.

Multiplication in Multimodal LLMs: Computation with Text, Image, and Audio Inputs

Multimodal LLMs can accurately perceive numerical content across modalities yet fail to perform exact multi-digit multiplication when the identical unde...

MultiRef-Compass: Towards Comprehensive Evaluation of Multi-Reference-to-Audio-Video Generation

Multi-reference-to-audio-video (MR2AV) generation aims to generate coherent audio-video content conditioned on multiple references and textual instructi...

Multivariate Spatio-Temporal Neural Hawkes Processes

We propose a Multivariate Spatio-Temporal Neural Hawkes Process for modeling complex multivariate event data with spatio-temporal dynamics. The proposed...

MultiWorld: Scalable Multi-Agent Multi-View Video World Models

Video world models have achieved remarkable success in simulating environmental dynamics in response to actions by users or agents. They are modeled as...

MuScriptor: An Open Model for Multi-Instrument Music Transcription

Existing methods for automatic music transcription are often limited to single-instrument recordings or fail on complex, real music mixes. Although prev...

MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation

Large language model (LLM) agents rely on reusable skills to solve complex tasks. However, existing skill creation approaches treat skills as isolated a...

MuseBench: Benchmarking Intent-Level Audiovisual Arts Understanding in MLLMs

Audiovisual arts encompass diverse creative disciplines, including cinema, visual arts, stage performance, and game design, where artistic meaning arise...

MuViT: Multi-Resolution Vision Transformers for Learning Across Scales in Microscopy

Modern microscopy routinely produces gigapixel images that contain structures across multiple spatial scales, from fine cellular morphology to broader t...

MXNorm: Reusing MXFP block scales for efficient tensor normalisation

Matrix multiplication performance has long been the major bottleneck to scaling deep learning workloads, which has stimulated the design of new accelera...

Narrative Media Framing in Political Discourse.

Narrative Media Framing in Political Discourse. - published at ACL 2025.

Narrative-Driven Paper-to-Slide Generation via ArcDeck

We introduce ArcDeck, a multi-agent framework that formulates paper-to-slide generation as a structured narrative reconstruction task. Unlike existing m...

Native Audio-Visual Alignment for Generation

Joint audio-video generation aims to synthesize temporally synchronized and semantically coherent visual-acoustic content. However, existing open-source...

Navigating the Mirage: A Dual-Path Agentic Framework for Robust Misleading Chart Question Answering

Despite the success of Vision-Language Models (VLMs), misleading charts remain a significant challenge due to their deceptive visual structures and dist...

NCCL and Collective Communication

Deep dive into NCCL internals - the five collective operations, ring-allreduce algorithm, tree-reduce for small tensors, algorithm selection heuristics, tuning environment variables, and diagnosing collective hangs in production GPU clusters.

Near-Future Policy Optimization

Reinforcement learning with verifiable rewards (RLVR) has become a core post-training recipe. Introducing suitable off-policy trajectories into on-polic...

Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models

Normalization layers in modern large language models (LLMs) consist of a deterministic normalization operation and a learnable scale vector. While the n...

Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

We describe the pre-training, post-training, and quantization of Nemotron 3 Super, a 120 billion (active 12 billion) parameter hybrid Mamba-Attention Mi...

Nemotron-CrossThink: Scaling Self-Learning beyond Math Reasoning.

Nemotron-CrossThink: Scaling Self-Learning beyond Ma... - published at EACL 2026.

Nemotron-Labs-Diffusion: A Tri-Mode Language Model Unifying Autoregressive, Diffusion, and Self-Speculation Decoding

We introduce Nemotron-Labs-Diffusion, a tri-mode language model (LM) that unifies AR, diffusion, and self-speculation decoding within a single architect...

Network Debugging for Distributed Training

Master distributed training network debugging - NCCL error diagnosis, AllReduce communication patterns, bandwidth testing with iperf3 and nccl-tests, RDMA diagnostics, and profiler-based timeline analysis for PyTorch DDP.

Network Security for ML Platforms

Comprehensive network security for ML infrastructure - mTLS service authentication, Kubernetes network policies, eBPF with Cilium, secrets management with Vault, zero-trust networking, and ML-specific threats including model theft and prompt injection.

Neural Collaborative Filtering - Beyond the Dot Product

How deep learning revolutionized recommendations by replacing the linear dot product with learnable nonlinear interactions between users and items.

Neural Computers

We propose a new frontier: Neural Computers (NCs) -- an emerging machine form that unifies computation, memory, and I/O in a learned runtime state. Unli...

Neural Diffusion Intensity Models for Point Process Data

Cox processes model overdispersed point process data via a latent stochastic intensity, but both nonparametric estimation of the intensity model and pos...

Neural Operators Can Discover Functional Clusters

Operator learning is reshaping scientific computing by amortizing inference across infinite families of problems. While neural operators (NOs) are incre...

Neuro-Symbolic ODE Discovery with Latent Grammar Flow

Understanding natural and engineered systems often relies on symbolic formulations, such as differential equations, which provide interpretability and t...

NeuroCogMap Reveals Cognitive Organization of Large Language Models

Understanding how complex cognitive functions are organized within artificial systems is central to interpreting large language models (LLMs) and relati...

NeuROK: Generative 4D Neural Object Kinematics

Data-driven approaches have revolutionized 3D vision, enabling transformers to effectively reconstruct and generate static 3D objects. However, generati...

NLP for Educational Content

Learn readability scoring, educational NER, automatic summarization, curriculum alignment, concept map generation, and question difficulty estimation for educational NLP pipelines.

NOBLE: Accelerating Transformers with Nonlinear Low-Rank Branches

We introduce NOBLE (Nonlinear lOw-rank Branch for Linear Enhancement), an architectural augmentation that adds nonlinear low-rank branches to transforme...

Non-Asymptotic Convergence of Stochastic Iterative Algorithms: A Lyapunov Framework

We survey Lyapunov-based techniques for the finite-time analysis of stochastic iterative algorithms, also known as stochastic approximation (SA) algorit...

Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration

Reinforcement learning with verifiable rewards, particularly Group Relative Policy Optimization (GRPO), has significantly advanced the reasoning capabil...

NonZero: Interaction-Guided Exploration for Multi-Agent Monte Carlo Tree Search

Monte Carlo Tree Search (MCTS) scales poorly in cooperative multi-agent domains because expansion must consider an exponentially large set of joint acti...

NormAL LoRA: What is the perfect size?

NormAL LoRA: What is the perfect size? - published at EMNLP 2025.

Normalizing Trajectory Models

Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a...

Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models

Recent advances in masked diffusion language models (MDLMs) narrow the quality gap to autoregressive LMs, but their sampling remains expensive because g...

Not All Disagreement Is Learnable: Token Teachability in On-Policy Distillation

On-policy distillation (OPD) trains a student on its own rollouts with token-level teacher supervision. Recent selective OPD methods exploit the non-uni...

NSF-SciFy: Mining the NSF Awards Database for Scientific Claims

We introduce NSF-SciFy, a comprehensive dataset of scientific claims and investigation proposals extracted from National Science Foundation award abstra...

Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR

Reinforcement learning with verifiable rewards (RLVR) has emerged as a scalable paradigm for improving the reasoning capabilities of large language mode...

Numerical and Categorical Features

Systematic feature engineering for tabular data - transformations, encoding, imputation, and selection that lifted AUC from 0.71 to 0.84.

NumPy for ML

Master NumPy for machine learning - broadcasting, vectorization, linear algebra, memory layout, einsum, and the performance patterns every ML engineer needs.

OAuth 2.0 and OIDC

Implement OAuth 2.0 authorization code flow with PKCE, OpenID Connect ID tokens, Keycloak integration, and delegated authorization in FastAPI with authlib.

Object Detection: YOLO and R-CNN

Two-stage and one-stage object detection architectures - from sliding windows and R-CNN to Faster R-CNN, YOLO v8, FPN, anchor boxes, NMS, IoU, and mAP - with full PyTorch implementations.

Observability and Logging

Observability for ML systems - structured logging with structlog, distributed tracing with OpenTelemetry, Prometheus metrics for inference servers, Grafana dashboards, ML-specific alerting, and production profiling.

Observability for LLM Apps

Build production observability for LLM applications - distributed tracing, quality metrics, cost attribution, prompt versioning, and drift detection using LangSmith, Langfuse, and Helicone.

Observable Performance Does Not Fully Reflect System Organization: A Multi-Level Analysis of Gait Dynamics Under Occlusal Constraint

In biomechanical systems, observable performance is often used as a proxy for underlying system organization. However, this assumption implicitly presum...

Observationally Informed Adaptive Causal Experimental Design

Randomized Controlled Trials (RCTs) represent the gold standard for causal inference yet remain a scarce resource. While large-scale observational data...

OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models

AI agents are expected to perform professional work across hundreds of occupational domains (from emergency department triage to nuclear reactor safety...

Occupancy and Thread Block Tuning

How GPU occupancy works, what limits it, and how to tune thread block size and register usage to maximize SM utilization without falling into the 100% occupancy trap.

OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models

The vast and underexplored ocean plays a critical role in regulating global climate and supporting marine biodiversity, yet artificial intelligence has...

Odysseus Navigates the Sirens' Song: Dynamic Focus Decoding for Factual and Diverse Open-Ended Text Generation.

Odysseus Navigates the Sirens' Song: Dynamic Fo... - published at ACL 2025.

Offline vs Online Evaluation - Why Your AUC Goes Up But Revenue Goes Down

A deep dive into offline and online evaluation strategies, A/B testing fundamentals, sample size calculation, interleaving, and the root causes of the offline-online metric gap.

Offline vs. Online Evaluation

Design an evaluation strategy that bridges static datasets and production signals - A/B testing, shadow evaluation, implicit signals, and the evaluation flywheel.

Ollama and Local Model Management

Ollama - Docker-like CLI for running and managing local LLMs. Modelfile format, REST API, OpenAI-compatible endpoints, Python integration, and building a complete local AI stack.

OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation

Cross-embodiment video generation aims to transfer motions across different humanoid embodiments, such as human-to-robot and robot-to-robot, enabling sc...

OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering

To extend the reinforcement learning post-training paradigm to omni-modal models for concurrently bolstering video-audio understanding and collaborative...

OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources

Real-world information needs require access to structurally diverse knowledge sources, from unstructured text and relational tables to knowledge graphs...

OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video

Current multimodal large language models (MLLMs) have demonstrated remarkable capabilities in short-form video understanding, yet translating long-form...

OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditio...

OmniTacTune: Policy-Agnostic Real-World RL for Tactile Residual Adaptation of Visual Policies

Visual policies learned from human videos, teleoperation, and robot demonstrations offer scalable motion priors, but often fail in contact-rich manipula...

OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration

Visual outcomes are increasingly central to multimodal large language models, making reliable and fine-grained verification essential for scaling genera...

On Locality and Length Generalization in Visual Reasoning

A striking feature of the human visual system is that it ingests visual information through a series of local foveated glimpses, rather than a single gl...

On public and private binary classification with metric space valued predictors

We consider the problem of binary classification in a framework where the predictor $X$ takes values in an arbitrary separable metric space $\mathcal X$...

On Semiotic-Grounded Interpretive Evaluation of Generative Art

Interpretation is essential to deciphering the language of art: audiences communicate with artists by recovering meaning from visual artifacts. However,...

On the Global Photometric Alignment for Low-Level Vision

Supervised low-level vision models rely on pixel-wise losses against paired references, yet paired training sets exhibit per-pair photometric inconsiste...

On the Relationship Between Activation Outliers and Feature Death in Sparse Autoencoders

Sparse autoencoders (SAEs) decompose neural network activations into interpretable features, but many learned features never activate, a problem called...

On the Reliability of Computer Use Agents

Computer-use agents have rapidly improved on real-world tasks such as web navigation, desktop automation, and software interaction, in some cases surpas...

On the Robustness of LLM-Based Dense Retrievers: A Systematic Analysis of Generalizability and Stability

Decoder-only large language models (LLMs) are increasingly replacing BERT-style architectures as the backbone for dense retrieval, achieving substantial...

On the Step Length Confounding in LLM Reasoning Data Selection

Large reasoning models have recently demonstrated strong performance on complex tasks that require long chain-of-thought reasoning, through supervised f...

On-Policy Adversarial Flow Distillation for Autoregressive Video Generation

Autoregressive video generators are attractive for streaming, long-horizon, and interactive applications, but distilling strong black-box teachers into...

On-Policy Delta Distillation

On-policy distillation is an alternative post-training method in reinforcement learning that alleviates the constraints imposed by reward models by prov...

One Click per Cell Type Suffices: Training-free Group Interaction for Cell Instance Segmentation

Cell instance segmentation models trained on cell-specific datasets suffer severe performance drops on out-of-distribution cell types, while interactive...

One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers.

One Tokenizer To Rule Them All: Emergent Language Pl... — published at ACL 2026.

One-Shot Generative Flows: Existence and Obstructions

We study dynamic measure transport for generative modelling in the setting of a stochastic process $X_\bullet$ whose marginals interpolate between a sou...

ONE-SHOT: Compositional Human-Environment Video Synthesis via Spatial-Decoupled Motion Injection and Hybrid Context Integration

Recent advances in Video Foundation Models (VFMs) have revolutionized human-centric video synthesis, yet fine-grained and independent editing of subject...

OneHOI: Unifying Human-Object Interaction Generation and Editing

Human-Object Interaction (HOI) modelling captures how humans act upon and relate to objects, typically expressed as <person, action, object> triplets. E...

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature im...

Online Controlled Experiments

Design valid ML experiments by choosing the right randomization unit, handling network effects, detecting novelty, and managing holdout sets.

Online Feature Computation for Model Serving

How to compute ML features at request time without blowing your latency budget - caching strategies, vectorized computation, and production patterns.

Online Learning

Continuous learning in production - online learning vs mini-batch, concept drift adaptation, Vowpal Wabbit, streaming gradient descent, bandit algorithms, and preventing catastrophic forgetting.

Online Quantile Regression for Nonparametric Additive Models

This paper introduces a projected functional gradient descent algorithm (P-FGD) for training nonparametric additive quantile regression models in online...

Online vs Offline Features

The fundamental split between pre-computed offline and real-time online features.

Open LLM Leaderboard and Benchmarks

Understanding the HuggingFace Open LLM Leaderboard, what each benchmark actually measures, how contamination distorts scores, and how to use leaderboard numbers to make real deployment decisions.

Open Political Corpora: Structuring, Searching, and Analyzing Political Text Collections with PoliCorp.

Open Political Corpora: Structuring, Searching, and... - published at EMNLP 2025.

OpenAI Embeddings and API-Based Embedding Services

text-embedding-3, Matryoshka training, Voyage AI, Cohere Embed, cost analysis, batch processing patterns, and when to choose API vs self-hosted embeddings.

OpenAI o1 and o3 - Architecture and Training

What we know about OpenAI's o1 and o3 reasoning models - hidden chain-of-thought, reinforcement learning from process rewards, compute budget tokens, and ARC-AGI results.

OpenAI Swarm

OpenAI's experimental multi-agent framework: agents, handoffs, context variables, and the triage pattern. What it gets right and wrong.

OpenCoF: Learning to Reason Through Video Generation

Reasoning has become a core capability for large models, especially when reliable decisions require understanding logical consequences. Recent video gen...

OpenGame: Open Agentic Coding for Games

Game development sits at the intersection of creative design and intricate software engineering, demanding the joint orchestration of game engines, real...

OpenMobile: Building Open Mobile Agents with Task and Trajectory Synthesis

Mobile agents powered by vision-language models have demonstrated impressive capabilities in automating mobile tasks, with recent leading models achievi...

OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents

Deep search has become a crucial capability for frontier multimodal agents, enabling models to solve complex questions through active search, evidence v...

OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories

Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet their development remains dominated...

OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents

Skills, i.e., structured workflow instructions distilled for large language models (LLMs), are becoming an increasingly important mechanism for improvin...

OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence

Spatial understanding is a fundamental cornerstone of human-level intelligence. Nonetheless, current research predominantly focuses on domain-specific d...

OpenTelemetry for AI Systems

Apply OpenTelemetry to AI and LLM applications - GenAI semantic conventions, auto-instrumentation, OTel Collector routing, sampling strategies, context propagation through async queues, and multi-backend production setups.

OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks

Group Relative Policy Optimization (GRPO) has emerged as the de facto Reinforcement Learning (RL) objective driving recent advancements in Multimodal La...

OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

World models have garnered significant attention as a promising research direction in artificial intelligence, yet a clear and unified definition remain...

OPSD-V: On-Policy Self-Distillation for Post-Training Few-Step Autoregressive Video Generators

We propose OPSD-V, an on-policy self-distillation paradigm for post-training few-step autoregressive (AR) video diffusion models. Existing few-step AR v...

Optimal Spatio-Temporal Decoupling for Bayesian Conformal Prediction

Online Conformal Prediction (CP) struggles to balance temporal adaptability and structural stability. Feedback-driven methods (e.g., Adaptive Conformal...

Optimization Algorithms Deep Dive

Optimization algorithms in depth - SGD, momentum, Nesterov, AdaGrad, RMSProp, Adam derivation, AdamW, learning rate schedules, second-order methods, convergence theory, and why Adam beats SGD for transformers.

Optimized Deferral for Imbalanced Settings

Learning algorithms can be significantly improved by routing complex or uncertain inputs to specialized experts, balancing accuracy with computational c...

Optimizers: Adam, SGD, RMSProp

Complete optimizer guide - SGD momentum, Nesterov, AdaGrad, RMSProp, Adam bias correction derivation, AdamW decoupled weight decay, LAMB, Lion, AMSGrad - with NumPy Adam from scratch, PyTorch implementations, and the SGD vs Adam generalization debate.

Optimizing ML Docker Images

Reduce ML Docker images from 8GB to under 1.5GB using multi-stage builds, slim bases, BuildKit cache mounts, and image scanning.

OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

Large language model (LLM) agents are increasingly used to assist with operations research (OR) modeling, yet existing OR-oriented benchmarks often redu...

Orchard: An Open-Source Agentic Modeling Framework

Agentic modeling aims to transform LLMs into autonomous agents capable of solving complex tasks through planning, reasoning, tool use, and multi-turn in...

Orchestration Patterns for End-to-End ML Pipelines

What dynamic DAGs, sensors, and fan-out/fan-in patterns do for AI systems, when ML workflows require data-aware scheduling and conditional branching across training and serving stages, and how to apply these patterns in production AI data pipelines.

Orchestrator-Subagent Pattern

The most reliable multi-agent pattern: one orchestrator plans, subagents execute. Deep dive into task decomposition, assignment strategies, and production-grade implementation.

OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning

Diffusion Transformers achieve strong video generation quality, but the quadratic cost of full attention limits efficiency. We introduce OSP-Next, an ef...

OT on the Map: Quantifying Domain Shifts in Geographic Space

In computer vision and machine learning for geographic data, out-of-domain generalization is a pervasive challenge, arising from uneven global data cove...

Out-of-distribution transfer of PDE foundation models to material dynamics under extreme loading

Most PDE foundation models are pretrained and fine-tuned on fluid-centric benchmarks. Their utility under extreme-loading material dynamics remains uncl...

Outlines - Grammar-Constrained Generation

A complete guide to the Outlines library - Pydantic schema to FSM, regex constraints, JSON schema constraints, vLLM integration, and production deployment patterns with guaranteed output conformance.

Overload and Type Narrowing

Use @overload for multiple function signatures and TypeGuard, TypeIs, assert_never, and pattern matching for exhaustive type narrowing in Python.

Overview

Overview of cloud data platforms for AI and ML workloads.

Overview

Module overview for Pipeline Orchestration - turning ad-hoc scripts into reliable, observable, recoverable production data pipelines.

Overview

Overview of real-time feature engineering for low-latency ML systems.

OvisOCR2 Technical Report

We introduce OvisOCR2, a 0.8B document parsing model. OvisOCR2 is designed as an end-to-end parser: given a document page image, it generates a Markdown...

p1: Better Prompt Optimization with Fewer Prompts

Prompt optimization improves language models without updating their weights by searching for a better system prompt, but its effectiveness varies widely...

Packaging and Environments - Module Overview

Master Python packaging and environments at full engineering depth - virtual environments, pip and lockfiles, pyproject.toml, Poetry, semantic versioning, and publishing to PyPI for production-grade projects.

Packaging Projects - Overview

Overview of hands-on projects for Module 05 - Packaging and Environments. Build, test, version, and publish a real Python utility package from scratch.

PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control

Large vision-language models have significantly advanced GUI agents, enabling executable interaction across web, mobile, and desktop interfaces. Yet the...

PalmClaw: A Native On-Device Agent Framework for Mobile Phones

Large Language Model (LLM) agents have moved beyond generating responses to executing multi-step tasks by calling tools, observing the results, and iter...

Pandas for ML

Pandas for machine learning engineers - DataFrame operations, missing data, groupby feature aggregation, time series, memory optimization, and building leakage-free feature matrices.

PANDO: Efficient Multimodal AI Agents via Online Skill Distillation

Recent advances in multimodal web agents often rely on increased inference-time computation, including rollout search, verifier passes, offline skill di...

Panoptic Pairwise Distortion Graph

In this work, we introduce a new perspective on comparative image assessment by representing an image pair as a structured composition of its regions. I...

PanoWorld: Real-World Panoramic Generation

In this work, we aim to address the challenge of long-range memory in panoramic world models by exploiting the rotation-equivariant property of omnidire...

Pantheon360: Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion

Generating complete digital twins from videos requires precise camera control, global scene coverage, and strict spatial-temporal consistency constraint...

Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework

The rapid growth of scientific literature has made it increasingly difficult for researchers to efficiently discover, evaluate, and synthesize relevant...

Paper Espresso: From Paper Overload to Research Insight

The accelerating pace of scientific publishing makes it increasingly difficult for researchers to stay current. We present Paper Espresso, an open-sourc...

Parallax: Parameterized Local Linear Attention for Language Modeling

Large Language Models (LLMs) have become the central paradigm in artificial intelligence, yet the core computational primitive of attention has remained...

Parallel Agent Execution

Running agents concurrently with asyncio, worker pools, DAG-based scheduling, rate limiting, and cost/speed tradeoffs in parallel multi-agent systems.

Parallelized Autoregressive Decoding for Omni-Modal Dense Video Captioning

Dense video captioning aims to generate temporally grounded descriptions of video events, benefiting both event-level video understanding and generation...

Parameter-Efficient Multi-View Proficiency Estimation: From Discriminative Classification to Generative Feedback

Estimating how well a person performs an action, rather than which action is performed, is central to coaching, rehabilitation, and talent identificatio...

ParamMem: Augmenting Language Agents with Parametric Reflective Memory

Self-reflection enables language agents to iteratively refine solutions, yet often produces repetitive outputs that limit reasoning performance. Recent...

ParamSpec and Concatenate

Solve the decorator typing problem with ParamSpec and Concatenate -- preserve callable signatures through wrappers, type retry/logging decorators, and apply patterns from FastAPI middleware.

ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

Training large multimodal models (LMMs) via reinforcement learning (RL) to natively invoke video-processing tools (e.g., cropping) has become a promisin...

Parcae: Scaling Laws For Stable Looped Language Models

Traditional fixed-depth architectures scale quality by increasing training FLOPs, typically through increased parameterization, at the expense of a high...

Partial Application and Currying - functools.partial, operator, and Function Pipelines

Master partial application and currying at engineering depth - functools.partial internals, inspecting partial objects, the distinction between partial application and currying, implementing currying in Python, the operator module as curried-style operations, function composition with reduce, and real-world usage in Django ORM, sorted(), and data pipelines.

Partition Function Estimation under Bounded f-Divergence

We study the statistical complexity of estimating partition functions given sample access to a proposal distribution and an unnormalized density ratio f...

Partition, Prompt, Aggregate: Statistical Self-Consistency in Language Models

In-context learning is commonly interpreted as a form of conditional inference, in which the prompt specifies a context and the model's output is treate...

PAST-TIDE: Prototype-Anchored Statement Tuning with Topic-Invariant Normalization for Stance Detection

We introduce PAST-TIDE, our stance detection system addressing both subtasks of the StanceNakba Shared Task at NakbaNLP@LREC-COLING 2026. The main idea...

Patient Outcome Prediction

Building clinical prediction models for hospital readmission, ICU mortality, and sepsis onset - feature engineering from EHR data, LSTM models for vital sign time series, survival analysis, calibration, and deployment in clinical workflows.

PatRe: A Full-Stage Office Action and Rebuttal Generation Benchmark for Patent Examination

Patent examination is a complex, multi-stage process requiring both technical expertise and legal reasoning, increasingly challenged by rising applicati...

PCA Dimensionality Reduction

Principal Component Analysis via eigendecomposition and SVD - covariance geometry, reconstruction error, Kernel PCA, Incremental PCA, whitening, and production use for preprocessing and anomaly detection.

PCIe and NVLink Interconnects

Host-to-device PCIe bandwidth, GPU-to-GPU NVLink and NVSwitch, the interconnect hierarchy in multi-GPU systems, and how interconnect bandwidth shapes model parallelism strategies.

PCIe and NVLink Interconnects

Understand PCIe bandwidth limitations for CPU-GPU data transfer, NVLink for high-speed GPU-to-GPU communication, NVSwitch topology in DGX systems, and how to design systems that avoid interconnect bottlenecks in multi-GPU AI training.

PEAM: Parametric Embodied Agent Memory through Contrastive Internalization of Experience in Minecraft

We present PEAM, a Parametric Embodied Agent Memory framework in Minecraft that transforms agent memory from inference-time retrieval into parameter-res...

pEBR: A Probabilistic Approach to Embedding Based Retrieval.

pEBR: A Probabilistic Approach to Embedding Based Re... - published at EMNLP 2025.

PEEK: Picking Essential frames via Efficient Knowledge distillation

Video-language models can process only a limited number of frames, making frame selection a key bottleneck for efficient video captioning. Most captioni...

PEFT-Arena: Understanding Parameter-Efficient Finetuning from a Stability-Plasticity Perspective

Parameter-efficient finetuning (PEFT) has become the standard approach for adapting large language models, yet evaluations largely emphasize downstream...

Perceptron and MLP

From the McCulloch-Pitts neuron to multi-layer perceptrons - the mathematical foundations of deep learning, XOR proof, universal approximation, forward pass mechanics, depth vs width theory, and full NumPy and PyTorch implementations.

Perceptual Flow Network for Visually Grounded Reasoning

Despite the success of Large-Vision Language Models (LVLMs), general optimization objectives (e.g., standard MLE) fail to constrain visual trajectories,...

Perplexity and Language Model Metrics

Understand perplexity, cross-entropy, bits per byte, and when intrinsic metrics mislead you about model quality.

Persona-SQ: A Personalized Suggested Question Generation Framework For Real-world Documents.

Persona-SQ: A Personalized Suggested Question Genera... - published at NAACL 2025.

PersonalAI: A Systematic Comparison of Knowledge Graph Storage and Retrieval Approaches for Personalized LLM agents

Personalizing language models by effectively incorporating user interaction history remains a central challenge in the development of adaptive AI system...

Personalisation and Memory

User preference learning, conversation memory architecture, and personalised AI experiences that persist across sessions.

Personalization at Scale

Two-tower retrieval models, real-time feature serving, ANN search, and the full ML architecture that powers personalized recommendations for hundreds of millions of retail users.

Personalize-then-Store: Benchmarking and Learning Personalized Memory for Long-horizon Agents

Existing large language model (LLM) based memory systems apply universal, static policies that overlook a fundamental reality: the contexts that are wor...

Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization

Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a centr...

Personalized Tutoring AI

Learn how to build AI tutoring systems using Socratic dialogue, LLM-based hint generation, worked example fading, affective state detection, and multi-session context management.

Personalizing Text-to-Image Generation to Individual Taste

Modern text-to-image (T2I) models generate high-fidelity visuals but remain indifferent to individual user preferences. While existing reward models opt...

PersonaVLM: Long-Term Personalized Multimodal LLMs

Multimodal Large Language Models (MLLMs) serve as daily assistants for millions. However, their ability to generate responses aligned with individual pr...

Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics

Recent advances in generative video modeling, driven by large-scale datasets and powerful architectures, have yielded remarkable visual realism. However...

Phase transitions in Doi-Onsager, Noisy Transformer, and other multimodal models

We study phase transitions for repulsive-attractive mean-field free energies on the circle. For a $\frac{1}{n+1}$-periodic interaction whose Fourier coe...

Phi and Small Language Models

Microsoft Phi model family - textbook quality data hypothesis, how 1-4B models can match much larger ones on reasoning tasks, and the design principles behind efficient small language models.

Phoenix by Arize - LLM Observability with Embedding Analysis

Master Arize Phoenix for open-source LLM observability - UMAP embedding visualization, drift detection, RAG coverage gap analysis, OpenTelemetry-native tracing, and LLM evaluation pipelines in production.

Phone Segmentation and Recognition through Phonological Activation Mapping

Phone segmentation and recognition are inherently related tasks, yet modern approaches typically model them separately. We argue that phonetic structure...

PhyCo: Learning Controllable Physical Priors for Generative Motion

Modern video diffusion models excel at appearance synthesis but still struggle with physical consistency: objects drift, collisions lack realistic rebou...

PhyGenHOI: Physically-Aware 4D Generation of Dynamic Human-Object Interactions

We address the task of generating physically accurate and visually faithful 4D Human-Object Interaction (HOI). Given a static 3D human and target object...

PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation

Generating realistic human motion is a central yet unsolved challenge in video generation. While reinforcement learning (RL)-based post-training has dri...

PhyMRI-SR: Toward Physics-Aware MRI Image Super-Resolution

Magnetic resonance imaging (MRI) super-resolution is vital for improving diagnostic accessibility, yet most methods treat it as a deterministic mapping...

PhysForge: Generating Physics-Grounded 3D Assets for Interactive Virtual World

Synthesizing physics-grounded 3D assets is a critical bottleneck for interactive virtual worlds and embodied AI. Existing methods predominantly focus on...

PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments

We introduce PhysicianBench, a benchmark for evaluating LLM agents on physician tasks grounded in real clinical setting within electronic health record...

Physics Informed Viscous Value Representations

Offline goal-conditioned reinforcement learning (GCRL) learns goal-conditioned policies from static pre-collected datasets. However, accurate value esti...

PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization

Recent progress in text-conditioned human motion generation has been largely driven by diffusion models trained on large-scale human motion data. Buildi...

PianoCoRe: Combined and Refined Piano MIDI Dataset

Symbolic music datasets with matched scores and performances are essential for many music information retrieval (MIR) tasks. Yet, existing resources oft...

pip and requirements - Dependency Management in Practice

Master pip and requirements files at full engineering depth - dependency resolution, version specifiers, pip-tools lockfiles, layered requirements, hash verification, supply-chain security, and private package indexes for production workflows.

Planning and Reasoning

How LLM agents handle complex multi-step tasks through plan-and-execute, hierarchical planning, self-reflection, and LangGraph-based workflows.

PlayCoder: Making LLM-Generated GUI Code Playable

Large language models (LLMs) have achieved strong results in code generation, but their ability to generate GUI applications, especially games, remains...

Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind

As large language models (LLMs) become the engine behind conversational systems, their ability to reason about the intentions and states of their dialog...

PledgeTracker: A System for Monitoring the Fulfilment of Pledges.

PledgeTracker: A System for Monitoring the Fulfilmen... - published at EMNLP 2025.

Plug-and-Play Diffusion Meets ADMM: Dual-Variable Coupling for Robust Medical Image Reconstruction

Plug-and-Play diffusion prior (PnPDP) frameworks have emerged as a powerful paradigm for solving imaging inverse problems by treating pretrained generat...

Plugin Systems - Building Extensible Applications

Build extensible Python applications with entry_points, importlib.metadata, stevedore, __init_subclass__, and plugin lifecycle management.

PLUME: Latent Reasoning Based Universal Multimodal Embedding

Universal multimodal embedding (UME) maps heterogeneous inputs into a shared retrieval space with a single model. Recent approaches improve UME by gener...

PluraMath: Extending Mathematical Reasoning Evaluation Beyond High-Resource Languages

Mathematical reasoning has become a central task for evaluating and tuning reasoning Large Language Models (LLMs), yet existing benchmarks remain heavil...

PO-KGQA: Preference Optimization for Low-Resource Complex Knowledge Graph Question Answering.

PO-KGQA: Preference Optimization for Low-Resource Co... — published at ACL 2026.

Pods, Deployments, and Services - Deep Dive

Master the three core Kubernetes workload primitives for ML engineers - stateless serving with Deployments, traffic routing with Services, and advanced pod patterns for ML.

POEMetric: The Last Stanza of Humanity

Large Language Models (LLMs) can compose poetry, but how far are they from human poets? In this paper, we introduce POEMetric, the first comprehensive f...

Poetry - Dependency Management and Packaging Done Right

Master Poetry at engineering depth - lockfile mechanics, version constraints, dependency groups, virtualenv management, publishing, and CI integration for reproducible Python builds.

Point-in-Time Correctness

Time-travel queries, point-in-time joins, and preventing data leakage.

PointDiT: Pixel-Space Diffusion for Monocular Geometry Estimation

State-of-the-art single-image 3D reconstruction methods often rely on complex hybrid architectures and loss functions, or compress geometry into latent...

PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models

While Vision-Language Models (VLMs) have achieved remarkable progress in static visual understanding, their deployment in complex 3D embodied environmen...

PokeRL: Reinforcement Learning for Pokemon Red

Pokemon Red is a long-horizon JRPG with sparse rewards, partial observability, and quirky control mechanics that make it a challenging benchmark for rei...

Policy Gradient Methods

Directly optimize policies with gradient ascent - REINFORCE derivation, the log-derivative trick, variance reduction with baselines, actor-critic, A2C/A3C, and entropy regularization. The foundation for PPO and RLHF.

Policy-Aware Design of Large-Scale Factorial Experiments

Digital firms routinely run many online experiments on shared user populations. When product decisions are compositional, such as combinations of interf...

PolicyShiftGuard: Benchmarking and Improving Policy-Adaptive Image Guardrails

Image guardrails are typically trained and evaluated under a fixed safety policy, implicitly treating safety as an intrinsic property of an image. Real...

Polyglot Teachers: Evaluating Language Models for Multilingual Synthetic Data Generation

Synthesizing supervised finetuning (SFT) data from language models (LMs) to teach smaller models multilingual tasks has become increasingly common. Howe...

Polynomial Features and Kernel Methods

Extend linear models to nonlinear patterns - polynomial basis expansion, curse of dimensionality, Mercer's theorem for valid kernels, RBF kernel via infinite-dimensional feature space, kernel ridge regression dual form, Nyström and random Fourier features for scalability.

Pooling, Strides, and Padding

Why spatial downsampling exists, how max pooling and strided convolutions compare, how padding controls output dimensions, receptive field growth, dilated convolutions, transposed convolutions, and when to use each - with PyTorch examples.

Portkey

Use Portkey as a managed LLM gateway with built-in observability, virtual keys, guardrails, request tracing, feedback collection, and automated fallbacks across Claude, GPT-4o, and 250+ providers.

POS-ISP: Pipeline Optimization at the Sequence Level for Task-aware ISP

Recent work has explored optimizing image signal processing (ISP) pipelines for various tasks by composing predefined modules and adapting them to task-...

Position-Aware Depth Decay Decoding (D³): Boosting Large Language Model Inference Efficiency.

Position-Aware Depth Decay Decoding (D³): Boosting L... - published at ACL 2025.

Position: agentic AI orchestration should be Bayes-consistent

LLMs excel at predictive tasks and complex reasoning tasks, but many high-value deployments rely on decisions under uncertainty, for example, which tool...

Positional Encoding

How positional encodings inject sequence order information into transformers - from sinusoidal to RoPE.

Positional versus Symbolic Attention Heads: Learning Dynamics, RoPE Geometry, and Length Generalization

Transformer-based language models are widespread in today's society. As such, understanding the mechanisms by which they solve structured tasks and pred...

Post-Training Quantization Methods

A practical guide to PTQ methods for LLMs - GPTQ, AWQ, SmoothQuant, bitsandbytes, GGUF, and HQQ compared by accuracy, speed, memory, and production use case.

Power one sequential tests exist for weakly compact $\mathscr P$ against $\mathscr P^c$

Suppose we observe data from a distribution $P$ and we wish to test the composite null hypothesis that $P\in\mathscr P$ against a composite alternative...

PR3DICTR: A modular AI framework for medical 3D image-based detection and outcome prediction

Three-dimensional medical image data and computer-aided decision making, particularly using deep learning, are becoming increasingly important in the me...

Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost

Commercial TTS systems produce near-native Indic audio, but the best open-source bases (Chatterbox, Indic Parler-TTS, IndicF5) trail them on measured ph...

Pre-Commit Hooks - Automate Quality Gates Before Every Commit

Master the pre-commit framework at engineering depth - Git hook mechanics, .pre-commit-config.yaml structure, building production hook pipelines with ruff, black, mypy, detect-secrets, and pytest, CI integration, team adoption strategy, and hook performance tuning.

Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?

Unlike code completion, debugging requires localizing faults and applying targeted edits. We observe that frontier LLMs often regenerate correct but ove...

Predicting integers from continuous parameters

We study the problem of predicting numeric labels that are constrained to the integers or to a subrange of the integers. For example, the number of up-v...

Prediction-powered Inference by Mixture of Experts

The rapidly expanding artificial intelligence (AI) industry has produced diverse yet powerful prediction tools, each with its own network architecture,...

Predictive Coding Graphs are a Superset of Feedforward Neural Networks

Predictive coding graphs (PCGs) are a recently introduced generalization to predictive coding networks, a neuroscience-inspired probabilistic latent var...

Predictive Maintenance with AI

Learn how AI systems predict equipment failures before they happen using sensor data, feature engineering, anomaly detection, and remaining useful life prediction.

Prefect

Building and deploying production ML workflows using Prefect 2.x/3.x - flows, tasks, deployments, work pools, and observability.

Prefect and Modern Orchestration

Prefect orchestration deep dive - flows, tasks, deployments, work pools, automations, and a direct comparison with Apache Airflow.

Prescriptive Scaling Laws for Data Constrained Training

Training compute is increasingly outpacing the availability of high-quality data. This shifts the central challenge from optimal compute allocation to e...

Pretraining at Scale

The infrastructure, parallelism strategies, memory optimizations, and training data choices required to pretrain large language models on thousands of GPUs.

PRIM-cipal components analysis

Supervised No Free Lunch Theorems (NFLTs) are well studied, yet unsupervised NFLTs remain underexplored. For elliptical distributions, we prove that the...

Principled Analysis of Deep Reinforcement Learning Evaluation and Design Paradigms

Starting from the utilization of deep neural networks to approximate the state-action value function that led to winning one of the most challenging gam...

Prior-Aligned Data Cleaning for Tabular Foundation Models

Tabular Foundation Models (TFMs) achieve state-of-the-art zero-shot accuracy on small tabular datasets by meta-learning over synthetic data-generating p...

PRISM: A Multi-Dimensional Benchmark for Evaluating LLM Peer Reviewers

The rapid growth in submissions to machine learning venues has strained the scientific peer-review system and intensified interest in LLM-based automate...

PRISM: LLM-Guided Semantic Clustering for High-Precision Topics

In this paper, we propose Precision-Informed Semantic Modeling (PRISM), a structured topic modeling framework combining the benefits of rich representat...

PRISM: Position-encoded Regressive Inverse Spectral Model for Multilayer Thin-Film Design

The inverse problem of multilayer thin-film optical coatings design represents a complex combinatorial-continuous optimization challenge. We present PRI...

Privacy and Air-Gapped Deployment

Deploying LLMs in air-gapped environments without internet access - pre-downloading models, offline HuggingFace usage, regulatory compliance, and architecture for privacy-critical AI.

Privacy and Ethics in Synthetic Data

Copyright exposure, memorization risks, differential privacy, bias auditing, terms-of-service compliance, and the governance processes required for defensible synthetic data pipelines.

PRL-Bench: A Comprehensive Benchmark Evaluating LLMs' Capabilities in Frontier Physics Research

The paradigm of agentic science requires AI systems to conduct robust reasoning and engage in long-horizon, autonomous exploration. However, current sci...

Probabilistic Joint and Individual Variation Explained (ProJIVE) for Data Integration

Collecting multiple types of data on the same set of subjects is common in modern scientific applications including, genomics, metabolomics, and neuroim...

Probing the Geometry of Diffusion Models with the String Method

Understanding the geometry of learned distributions is fundamental to improving and interpreting diffusion models, yet systematic tools for exploring th...

Probing Visual Planning in Image Editing Models

Visual planning represents a crucial facet of human intelligence, especially in tasks that require complex spatial reasoning and navigation. Yet, in mac...

Problem-Solving Logic Guided Curriculum In-Context Learning for LLMs Complex Reasoning.

Problem-Solving Logic Guided Curriculum In-Context L... - published at ACL 2025.

Procedural Memory and Learned Skills

How agents store and reuse successful action sequences: skill formation, retrieval, composition, and refinement from execution feedback.

Proceedings of Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning @ COLING 2025.

Proceedings of Bridging Neurons and Symbols for Natu... - published at COLING 2025.

Proceedings of Context and Meaning: Navigating Disagreements in NLP Annotation.

Proceedings of Context and Meaning: Navigating Disag... - published at COLING 2025.

Proceedings of the 5th Celtic Language Technology Workshop.

Proceedings of the 5th Celtic Language Technology Wo... - published at COLING 2025.

Proceedings of the Joint Workshop of the 9th Financial Technology and Natural Language Processing (FinNLP), the 6th Financial Narrative Processing (FNP), and the 1st Workshop on Large Language Models for Finance and Legal (LLMFinLegal).

Proceedings of the Joint Workshop of the 9th Financi... - published at COLING 2025.

Process Optimization with Reinforcement Learning

Learn how to formulate manufacturing process control as an MDP, design safe reward functions, use offline RL from historical data, and deploy RL policies in production industrial settings.

Process Reward Models (PRMs)

How process reward models provide step-level supervision for reasoning - the Lightman et al. 2023 paper, Math-Shepherd, using PRMs for search, and their limitations.

Processes, Threads, and Coroutines

Learn how processes, threads, and coroutines work at the OS level, and how to choose the right concurrency model for ML workloads - data loading, inference, and async API calls.

Production Agent Monitoring

Monitoring agents in production - task completion metrics, distributed tracing, anomaly detection, alerting, and the production improvement flywheel.

Production Async Architecture

Build production-grade async systems with error handling strategies, graceful shutdown, health checks, backpressure, async testing with pytest-asyncio, and structured logging.

Production Lessons

12 hard-won lessons from deploying agentic systems at scale - each with a war story, a principle, and a code pattern you can use today.

Production Monitoring for LLMs

Build a comprehensive production monitoring stack for LLMs - latency, cost, quality drift, safety, and observability platforms compared.

Production Multimodal Systems

Build and operate multimodal AI pipelines at production scale - image preprocessing, cost control, VLM hallucination mitigation, caching, security, and observability for vision-language workloads.

Production Patterns

Case studies in real-time feature engineering from Uber, Twitter, and LinkedIn.

Profiling Python and C Code

Master the complete profiling toolkit - cProfile, line_profiler, py-spy, Scalene, Valgrind, and PyTorch Profiler - to find and eliminate bottlenecks in Python and ML training code.

Profiling Strategy - Measure Before You Optimize

Amdahl's law, the profiling workflow, identifying hotspots, benchmarking methodology with timeit, performance budgets, and the discipline of measuring before optimizing.

Profiling with Nsight

Learn how to use Nsight Systems and Nsight Compute to find GPU performance bottlenecks, read roofline charts, interpret warp stall reasons, and use the PyTorch profiler to guide real optimization decisions.

Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora

Reliably transferring specialized human knowledge from text into large language models remains a fundamental challenge in artificial intelligence. Fine-...

Project 01 - Publish an Internal Utility Package

Build, test, version, and publish pyutils-engineersofai - a typed Python utility library with src/ layout, hatchling build backend, full pytest coverage, CHANGELOG, and GitLab CI pipeline that publishes on v* tags.

Prometheus and Grafana for ML

Building production ML observability infrastructure - Prometheus architecture, custom ML metrics, PromQL for ML, Grafana dashboard design for model serving, and scaling with Thanos for long-term storage.

Prompt Debugging Methodology

Systematic methodology for diagnosing and fixing prompt failures - isolation, ablation, root cause analysis, and building a regression test suite.

Prompt Design Fundamentals

Master the first principles of prompt engineering - clarity, specificity, task framing, structural markers, and the systematic principles behind effective LLM instructions.

Prompt Injection

How prompt injection attacks work, why they are the most critical AI vulnerability in production, and how to defend against them with layered mitigations.

Prompt Injection and Security

Understand how prompt injection attacks work, why they're hard to defend against, and how to build LLM systems that are resistant to manipulation.

Prompt Injection Defense

Understand prompt injection attack taxonomy, detection strategies, defense layers, and sanitization techniques for production LLM systems.

Prompt Management

Treat prompts as production artifacts - versioning, registry design, testing frameworks, A/B testing prompts, automated optimization with DSPy, and prompt governance.

Prompt Optimization and DSPy

Move beyond manual prompt engineering to automated, evaluation-driven optimization - using APE, OPRO, and DSPy to build LLM pipelines that improve themselves.

Prompt Relay: Inference-Time Temporal Control for Multi-Event Video Generation

Video diffusion models have achieved remarkable progress in generating high-quality videos. However, these models struggle to represent the temporal suc...

Prompt Templates and Composition

Build maintainable, production-grade prompt systems with Jinja2 templates, variable injection, modular composition, and reusable prompt libraries.

Prompt UX Patterns

Prompt scaffolding, slash commands, context transparency, and mode switching in production AI interfaces.

Prompt Versioning

Treating prompts as first-class code artifacts - versioning, branching, review gates, A/B testing, and rollback for production LLM prompts. Build a complete prompt registry from scratch.

Prompt Versioning and Management

Treat prompts as code - semantic versioning, A/B testing, rollback strategies, and prompt registries for production LLM systems.

PromptLab: A Collaborative Platform for Prompt Engineering and Dataset Curation.

PromptLab: A Collaborative Platform for Prompt Engin... - published at EACL 2026.

ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation

Proactive Recommender Systems (PRSs) aim to guide user preference shift toward target items by generating paths of intermediate recommendations. Reinfor...

Protecting Language Models Against Unauthorized Distillation through Trace Rewriting

Knowledge distillation is a widely adopted technique for transferring capabilities from LLMs to smaller, more efficient student models. However, unautho...

Protocol and Structural Subtyping

Master typing.Protocol for structural subtyping in Python -- define interfaces based on behavior, compose protocols, make duck typing type-safe, and apply patterns from Django and file-like objects.

Proximal Policy Optimisation - The Algorithm That Runs ChatGPT's RLHF

PPO: the dominant policy gradient algorithm - how clipping the probability ratio prevents destructive policy updates while maintaining the efficiency of on-policy learning.

Proxy Exploration and Reusable Guidance: A Modular LLM Post-Training Paradigm via Proxy-Guided Update Signals

Post-training is essential for refining the domain-specific capabilities of large language models (LLMs), yet existing reward optimization and distribut...

Pruning and Depth Control

How to prevent decision tree overfitting through pre-pruning parameters, cost-complexity post-pruning, weakest-link pruning, MDL principle, and production-grade tuning strategies.

Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models

Unified multimodal models (UMMs) were designed to combine the reasoning ability of large language models (LLMs) with the generation capability of vision...

PSP: An Interpretable Per-Dimension Accent Benchmark for Indic Text-to-Speech

Standard text-to-speech (TTS) evaluation measures intelligibility (WER, CER) and overall naturalness (MOS, UTMOS) but does not quantify accent. A synthe...

PTOPOFL: Privacy-Preserving Personalised Federated Learning via Persistent Homology

Federated learning (FL) faces two structural tensions: gradient sharing enables data-reconstruction attacks, while non-IID client distributions degrade...

Publishing Packages - From Source to PyPI

Master Python package publishing at engineering depth - sdist vs wheel formats, build backends, TestPyPI workflow, twine and Poetry publishing, API tokens, private registries, and automated CI/CD release pipelines.

Pulumi for ML

Write ML infrastructure in real Python - Pulumi's code-first approach, component resources, Automation API, and testing with pytest for reproducible ML platforms.

Pure Functions - Testability, Memoisation, and the Functional Core Pattern

Master pure functions at engineering depth - same inputs always produce same outputs with no side effects, referential transparency, how to identify and eliminate side effects, the functional core / imperative shell architecture, and why purity unlocks testability, caching, and thread safety.

pyproject.toml - The Modern Python Project Standard

Master pyproject.toml at full engineering depth - PEP 517/518/621 build system specification, build backends, the full project table, optional dependencies, entry points, tool configuration, src layout, dynamic versioning, and building distribution artifacts.

pytest - The Industry-Standard Test Framework

Master pytest at full engineering depth - assertion rewriting via AST transformation, fixtures with scope, conftest.py, parametrize, monkeypatch, capsys, built-in marks, essential plugins, and pyproject.toml configuration for production test suites.

PyTorch DataLoaders and Datasets

Build custom PyTorch Datasets and high-performance DataLoaders - batching, num_workers, pin_memory, samplers, WebDataset for streaming, custom collate_fn, and profiling.

PyTorch Foundations

PyTorch fundamentals for ML engineers - tensors, autograd, nn.Module, device management, reproducibility, mixed precision training, and the computation graph that makes debugging natural.

PyTorch Training Loop

Write production-grade PyTorch training loops - learning rate scheduling, gradient accumulation, mixed precision, checkpointing, early stopping, and debugging.

Q-Learning and SARSA

Model-free temporal difference learning - Q-learning for off-policy control and SARSA for on-policy control. Understand TD vs MC vs DP, convergence conditions, eligibility traces, Double Q-learning, and implement Q-tables in NumPy.

Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

MLLMs require high-resolution visual inputs for fine-grained tasks like document understanding and dense scene perception. However, current global resol...

QEIL v2: Heterogeneous Computing for Edge Intelligence via Roofline-Derived Pareto-Optimal Energy Modeling and Multi-Objective Orchestration

Deploying large language models (LLMs) on heterogeneous edge devices demands frameworks that jointly optimize energy efficiency, inference quality, and...

QiMeng-PRepair: Precise Code Repair via Edit-Aware Reward Optimization

Large Language Models (LLMs) achieve strong program repair performance but often suffer from over-editing, where excessive modifications overwrite corre...

QLoRA: 4-Bit Fine-Tuning

Learn how QLoRA combines 4-bit NF4 quantization, double quantization, and paged optimizers to fine-tune 65B parameter models on a single GPU - covering the math, implementation, and production engineering.

QLoRA: Quantized Low-Rank Adaptation

Learn how QLoRA combines 4-bit quantization with LoRA to fine-tune 65B parameter models on a single consumer GPU, using NF4 quantization, double quantization, and paged optimizers.

QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents

Social deduction games have become a popular testbed for probing reasoning, deception, coordination, and belief modeling in Large Language Model (LLM) a...

Quality Metrics in Production LLM Systems

Define, measure, and operationalize quality metrics for production LLM applications - faithfulness, answer relevance, hallucination rate, coherence, toxicity, BLEU vs LLM-as-judge, SLO definitions, and async evaluation pipelines.

Qualixar OS: A Universal Operating System for AI Agent Orchestration

We present Qualixar OS, the first application-layer operating system for universal AI agent orchestration. Unlike kernel-level approaches (AIOS) or sing...

QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation

Large Language Models (LLMs) are increasingly used for code generation, yet quantum code generation is still evaluated mostly within single frameworks,...

Quantifying and Expanding the Theoretical Capacity of Late-Interaction Retrieval Models

Late-interaction retrieval models that use the MaxSim similarity function have shown strong empirical performance, often outperforming single-vector den...

Quantization Benchmarking

How to rigorously evaluate quantization quality using perplexity, downstream task accuracy, latency, and memory metrics - and build a complete benchmarking pipeline comparing FP16 vs GPTQ vs AWQ vs NF4.

Quantization Deep Dive

INT8, INT4, NF4, FP8, and block-wise quantization explained from first principles - how floating point becomes integer, what accuracy you lose, and how to tune quantization for production LLM inference.

Quantization Error Debugging

How to diagnose and fix quantization quality degradation - symptoms, root causes, diagnostic tools, and systematic fixes for INT4/INT8 quantized LLMs.

Quantization for Vision Models

How to quantize CNN and ViT vision models and vision-language models - handling batch norm sensitivity, attention outliers, and the strategy of quantizing the LLM backbone while keeping the vision encoder in FP16.

Quantization Hardware Tradeoffs

How INT8, INT4, FP8, and NF4 quantization change memory bandwidth utilization, Tensor Core throughput, and inference latency on real GPUs, including hardware support matrices and production calibration strategies.

Quantization-Aware Training

When post-training quantization is not enough - how QAT simulates quantization noise during training so models learn to be robust to it, covering the straight-through estimator, QLoRA, and BitNet.

Quantization: INT8 and INT4

Master LLM quantization techniques - from LLM.int8() to GPTQ and AWQ - to run large models on commodity hardware without unacceptable quality loss.

Quantum Diffusion Models: Score Reversal Is Not Free in Gaussian Dynamics

Diffusion-based generative modeling suggests reversing a noising semigroup by adding a score drift. For continuous-variable Gaussian Markov dynamics, co...

Quantum Interval Bound Propagation for Certified Training of Quantum Neural Networks

Quantum machine learning is a promising field for efficiently learning features of a dataset to perform a specified task, such as classification. Interv...

Query Transformation and HyDE

Master query transformation techniques - HyDE, multi-query retrieval, step-back prompting, query decomposition, and routing - to solve the vocabulary mismatch problem that breaks naive RAG systems in production.

QUEST: Training Frontier Deep Research Agents with Fully Synthetic Tasks

Deep research agents extend the role of search engines from retrieving keyword-matched pages to synthesizing knowledge, fundamentally changing how human...

Quotient-Based Posterior Analysis for Euclidean Latent Space Models

Latent space models are widely used in statistical network analysis and are often fit by Markov chain Monte Carlo. However, posterior summaries of laten...

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Embodied intelligence is often studied through specialized models for individual tasks such as manipulation or navigation, resulting in fragmented capab...

Qwen, DeepSeek, and International Models

Alibaba Qwen and DeepSeek architectural innovations - MLA attention, DeepSeekMoE, multi-token prediction, and how Chinese labs are advancing open-source LLM research.

Qwen3.5-Omni Technical Report

In this work, we present Qwen3.5-Omni, the latest advancement in the Qwen-Omni model family. Representing a significant evolution over its predecessor,...

R3PM-Net: Real-time, Robust, Real-world Point Matching Network

Accurate Point Cloud Registration (PCR) is an important task in 3D data processing, involving the estimation of a rigid transformation between two point...

RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

High-level autonomous driving requires motion planners capable of modeling multimodal future uncertainties while remaining robust in closed-loop interac...

Rad-Flamingo: A Multimodal Prompt driven Radiology Report Generation Framework with Patient-Centric Explanations.

Rad-Flamingo: A Multimodal Prompt driven Radiology R... - published at EACL 2026.

RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography

Vision-language models (VLM) have markedly advanced AI-driven interpretation and reporting of complex medical imaging, such as computed tomography (CT)....

RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments

We present RADIO-ViPE (Reduce All Domains Into One -- Video Pose Engine), an online semantic SLAM system that enables geometry-aware open-vocabulary gro...

Radiology AI in Production

Deploying radiology AI into clinical workflows - PACS integration, DICOM processing, FDA clearance, worklist prioritization, and monitoring for distribution shift in live hospital environments.

RAG Evaluation

Build rigorous RAG evaluation with RAGAS, TruLens, LLM-as-judge, golden datasets, and production monitoring - measure faithfulness, relevance, and groundedness.

RAG Evaluation and RAGAS

Build a continuous RAG evaluation pipeline using the RAGAS framework - faithfulness, answer relevance, context precision, and context recall - with full production implementations using the Anthropic SDK and automated regression detection.

RAG Evaluation Metrics

Evaluate RAG systems with precision - the RAG triad, RAGAS framework, golden datasets, and retrieval metrics for production pipelines.

RAG Pipeline Ops

Operate RAG pipelines in production - index refresh strategies, chunk strategy updates, embedding drift detection, vector database monitoring, and quality tracking.

RAG System Design

How to design Retrieval Augmented Generation systems for production - from naive RAG to advanced pipelines with chunking strategies, hybrid search, reranking, and RAG evaluation.

RAG vs Long Context - When to Use Each

A rigorous cost, latency, and accuracy comparison of retrieval-augmented generation versus long-context stuffing, with decision frameworks for production use cases.

RAG-Specific Evaluation

Master the full evaluation stack for Retrieval-Augmented Generation systems - covering RAGAS metrics, hallucination type classification, citation accuracy, retrieval precision/recall/nDCG, and production-grade benchmarking with complete Python implementations.

RAGEN-2: Reasoning Collapse in Agentic RL

RL training of multi-turn LLM agents is inherently unstable, and reasoning quality directly determines task performance. Entropy is widely used to track...

RaguTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation

We present our winning system for Task~B (generation with reference passages) in SemEval-2026 Task~8: MTRAGEval. Our method is a heterogeneous ensemble...

Random Forests

Master Random Forests from first principles - bagging variance reduction math, feature randomization, OOB error estimation, Extra-Trees, bias-variance decomposition, MDI vs permutation importance, and production deployment patterns.

Randomized Algorithms and Sketching

Randomized algorithms in ML - reservoir sampling for streaming data, Johnson-Lindenstrauss projections, Count-Min Sketch, HyperLogLog, randomized SVD, and locality-sensitive hashing for approximate nearest neighbor search.

Randomized Subspace Nesterov Accelerated Gradient

Randomized-subspace methods reduce the cost of first-order optimization by using only low-dimensional projected-gradient information, a feature that is...

Rank-Then-Act: Reward-Free Control from Frame-Order Progress

We introduce Rank-Then-Act (RTA), a framework for learning control policies from expert video demonstrations without environment rewards. RTA trains a V...

RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator

As interactive LLM-based applications are created and refined, model developers need to evaluate the quality of generated text along many possible axes....

Rate Limiting and Cost Control

Controlling costs and preventing abuse in LLM API serving - token-based rate limiting, Redis token buckets, tenant isolation, cost attribution, budget alerts, and abuse detection.

Rate Limiting and Quotas

Protect your LLM infrastructure from abuse and cost overruns with token bucket rate limiting and sliding window quotas per user, team, and feature - enforced at the gateway before any tokens are consumed.

RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

Most reward models for visual generation reduce rich human judgments to a single unexplained score, discarding the reasoning that underlies preference....

RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO

Causal autoregressive video diffusion models support real-time streaming generation by extrapolating future chunks from previously generated content. Di...

RAViT: Resolution-Adaptive Vision Transformer

Vision transformers have recently made a breakthrough in computer vision showing excellent performance in terms of precision for numerous applications....

Raw API Agent Patterns

Building production agents with just the Anthropic SDK - the agentic loop, tool handling, context management, cost tracking, and a complete 200-line implementation.

RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video

Self-supervised novel view synthesis (NVS) remains challenging to scale, despite the abundance of video data, largely due to the brittleness of training...

Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories

Recovering camera parameters from images and rendering scenes from novel viewpoints have long been treated as separate tasks in computer vision and grap...

RDP LoRA: Geometry-Driven Identification for Parameter-Efficient Adaptation in Large Language Models

Fine-tuning Large Language Models (LLMs) remains structurally uncertain despite parameter-efficient methods such as Low-Rank Adaptation (LoRA), as the l...

ReAct Agent Pattern

Building LLM agents that interleave reasoning traces and actions in a ReAct loop to solve multi-step tasks with tool grounding.

ReAct Pattern

Learn how to build LLM agents that reason and act by interleaving thought and tool calls - the architectural pattern behind every modern AI assistant.

ReactiveGWM: Steering NPC in Reactive Game World Models

Current game world models simulate environments from a subjective, player-centric perspective. However, by treating the Non-Player Character (NPC) merel...

Read It Back: Pretrained MLLMs Are Zero-Shot Reward Models for Text-to-Image Generation

In this paper, we propose SpectraReward, a training-free reward function that turns pretrained MLLMs into off-the-shelf reward models for image-generati...

Real-Time Aggregations

Windowed aggregations, sessionisation, and user behaviour features in real time.

Real-Time Feature Computation for ML Inference

How to build streaming feature pipelines that compute fresh ML features at production scale, including dual-store architecture, training-serving skew prevention, and hot key mitigation.

Real-Time Feature Engineering at Scale

Computing ML features from raw events within milliseconds - Redis patterns, sliding window aggregations, session detection, and Uber's Michelangelo real-time pipeline.

Real-Time Inference Design

Architecture for ML inference at 1M QPS with sub-10ms SLA - synchronous vs async real-time, circuit breakers, fallback models, and timeout budget management.

Real-Time Surrogate Modeling for Personalized Blood Flow Prediction and Hemodynamic Analysis

Cardiovascular modeling has rapidly advanced over the past few decades due to the rising needs for health tracking and early detection of cardiovascular...

REAM: Merging Improves Pruning of Experts in LLMs

Mixture-of-Experts (MoE) large language models (LLMs) are among the top-performing architectures. The largest models, often with hundreds of billions of...

REaR : Retrieve, Expand and Refine for Effective Multitable Retrieval.

REaR : Retrieve, Expand and Refine for Effective Mul... — published at ACL 2026.

Reasoning and Math Evaluation

Evaluating LLM mathematical and logical reasoning - GSM8K, MATH, AIME benchmarks, chain-of-thought evaluation, process reward models, self-consistency voting, and measuring multi-step reasoning quality.

Reasoning Knowledge Filter for Logical Table-to-Text Generation.

Reasoning Knowledge Filter for Logical Table-to-Text... - published at COLING 2025.

Reasoning-Enhanced Domain-Adaptive Pretraining of Multimodal Large Language Models for Short Video Content Governance.

Reasoning-Enhanced Domain-Adaptive Pretraining of Mu... - published at EMNLP 2025.

RecaLLM: Addressing the Lost-in-Thought Phenomenon with Explicit In-Context Retrieval

We propose RecaLLM, a set of reasoning language models post-trained to make effective use of long-context information. In-context retrieval, which ident...

RecGPT-V3 Technical Report

Large language models (LLMs) are transforming recommender systems from matching co-occurrence patterns in historical behavior toward reasoning about the...

RECIPE-TKG: From Sparse History to Structured Reasoning for LLM-based Temporal Knowledge Graph Completion.

RECIPE-TKG: From Sparse History to Structured Reason... - published at EACL 2026.

Recommendation Systems at Scale

End-to-end system design for YouTube-scale video recommendation - candidate generation, multi-stage ranking, post-processing for diversity, cold start, and session modeling.

ReconPhys: Reconstruct Appearance and Physical Attributes from Single Video

Reconstructing non-rigid objects with physical plausibility remains a significant challenge. Existing approaches leverage differentiable rendering for p...

Recovering Hidden Reward in Diffusion-Based Policies

This paper introduces EnergyFlow, a framework that unifies generative action modeling with inverse reinforcement learning by parameterizing a scalar ene...

Recovering Policy-Induced Errors: Benchmarking and Trajectory Synthesis for Robust GUI Agents

While GUI agents have advanced rapidly, they often lack the robustness to recover from their own errors, hindering real-world deployment. To bridge this...

Recursive Flow Matching

Generative models have emerged as a powerful paradigm for solving physics systems and modeling complex spatiotemporal dynamics. However, achieving high...

Recursive Harness Self-Improvement

Under model--harness co-evolution, harnesses are not merely inference-time scaffolds but data-generating components whose execution traces can shape fut...

Recursive Maximum Likelihood Estimation for Interacting Particle Systems using Virtual Particles

We study recursive maximum likelihood estimation for stochastic interacting particle systems based on continuous observation of a single particle. In th...

Recursive Multi-Agent Systems

Recursive or looped language models have recently emerged as a new scaling axis by iteratively refining the same model computation over latent states to...

Red Queen: Exposing Latent Multi-Turn Risks in Large Language Models.

Red Queen: Exposing Latent Multi-Turn Risks in Large... - published at ACL 2025.

Red Teaming AI Systems

Systematic adversarial testing of AI systems - methodology, automated red teaming, documentation, and building a continuous red team program.

Red Teaming LLMs

Systematic adversarial evaluation of language models - manual red teaming, automated red teaming with LLMs, failure taxonomies, and building a production red team process.

RedOne: Revealing Domain-specific LLM Post-Training in Social Networking Services.

RedOne: Revealing Domain-specific LLM Post-Training... - published at EMNLP 2025.

Reference Counting - How CPython Manages Memory at the C Level

Master CPython's reference counting mechanism at engineering depth - ob_refcnt, sys.getrefcount, ctypes raw refcount access, tp_dealloc, reference cycles, weakref module, and why del x does not immediately destroy an object.

RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details

We introduce region-specific image refinement as a dedicated problem setting: given an input image and a user-specified region (e.g., a scribble mask or...

Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models

Unified multimodal models (UMMs) integrate visual understanding and generation within a single framework. For text-to-image (T2I) tasks, this unified ca...

ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving

We introduce ReflectDrive-2, a masked discrete diffusion planner with separate action expert for autonomous driving that represents plans as discrete tr...

Reflective Context Learning: Studying the Optimization Primitives of Context Space

Generally capable agents must learn from experience in ways that generalize across tasks and environments. The fundamental problems of learning, includi...

Reflective Prompt Tuning through Language Model Function-Calling

Large language models (LLMs) have become increasingly capable of following instructions and complex reasoning, making prompting a flexible interface for...

Registering Source Tokens to Target Language Spaces in Multilingual Neural Machine Translation.

Registering Source Tokens to Target Language Spaces... - published at ACL 2025.

Registers Matter for Pixel-Space Diffusion Transformers

Vision Transformers (ViTs) are known to exhibit high-norm patch-token outliers that degrade feature map quality, a problem effectively mitigated by regi...

Regression Testing for Prompts

Build a production-grade regression testing system for LLM prompts - covering test case design, LLM-as-judge pass/fail evaluation, flaky test detection, caching, differential testing, and CI gates that block regressions before they reach users.

Regular Fourier Features for Nonstationary Gaussian Processes

Simulating a Gaussian process requires sampling from a high-dimensional Gaussian distribution, which scales cubically with the number of sample location...

Regularity of Solutions to Beckmann's Parametric Optimal Transport

Beckmann's problem in optimal transport minimizes the total squared flux in a continuous transport problem from a source to a target distribution. In th...

Regularization - L1, L2, and ElasticNet

Master regularization from first principles - bias-variance decomposition, L2 Bayesian interpretation as Gaussian prior, L1 sparsity via subdifferential geometry, elastic net path algorithms, coordinate descent for LASSO, and cross-validation for lambda selection.

Regularized Online RLHF with Generalized Bilinear Preferences

We consider the problem of contextual online RLHF with general preferences, where the goal is to identify the Nash Equilibrium. We adopt the Generalized...

ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis

Human video generation remains challenging due to the difficulty of jointly modeling human appearance, motion, and camera viewpoint under limited multi-...

Reinforcement Learning for Aligning Large Language Models Agents with Interactive Environments: Quantifying and Mitigating Prompt Overfitting.

Reinforcement Learning for Aligning Large Language M... - published at NAACL 2025.

Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces

As large language model (LLM) agents evolve from isolated tool users into coordinated teams, reinforcement learning (RL) must optimize not only individu...

Reinforcement Learning via Value Gradient Flow

We study behavior-regularized reinforcement learning (RL), where regularization toward a reference distribution (the dataset in offline RL or the base m...

Reinforcement Learning with Markov Risk Measures and Multipattern Risk Approximation

For a risk-averse finite-horizon Markov Decision Problem, we introduce a special class of Markov coherent risk measures, called mini-batch measures. We...

Reinforcing Few-step Generators via Reward-Tilted Distribution Matching

Recent advances in few-step diffusion distillation have enabled efficient image generation, yet aligning these models with human preferences remains cha...

Remember When It Matters: Proactive Memory Agent for Long-Horizon Agents

In long-horizon tasks, decision-relevant state is often scattered across an expanding trajectory, while the action agent must surface it and act. As tra...

RemoteZero: Geospatial Reasoning with Zero Human Annotations

Geospatial reasoning requires models to resolve complex spatial semantics and user intent into precise target locations for Earth observation. Recent pr...

REPOT: Recoverable Program-of-Thought via Checkpoint Repair

One-shot Program-of-Thought (PoT) emits a Python program that prints a primitive-action plan; a single invalid action silently invalidates the trajector...

Representation and String Methods - repr, str, format at Engineering Depth

Master Python's string representation protocol - __repr__ vs __str__, the eval() contract, __format__ for custom f-string specs, __bytes__, the !r !s !a conversion flags, and how great repr() transforms production debugging.

Representation Learning for Spatiotemporal Physical Systems

Machine learning approaches to spatiotemporal physical systems have primarily focused on next-frame prediction, with the goal of learning an accurate em...

Representation over Routing: Overcoming Surrogate Hacking in Multi-Timescale PPO

Temporal credit assignment in reinforcement learning has long been a central challenge. Inspired by the multi-timescale encoding of the dopamine system...

Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

Accurate future video prediction requires both high visual fidelity and consistent scene semantics, particularly in complex dynamic environments such as...

Representing the Under-Represented: Cultural and Core Capability Benchmarks for Developing Thai Large Language Models.

Representing the Under-Represented: Cultural and Cor... - published at COLING 2025.

Reproducibility and Auditability in ML Systems

Learn how to build fully reproducible ML systems - covering the reproducibility stack, DVC, MLflow, Docker, seed management, GDPR compliance, and financial model audits.

Reproducibility in ML

Learn the four layers of ML reproducibility - environment, data, code, and model - and how to achieve each in practice with Docker, DVC, MLflow, and seed management.

Request-Response Lifecycle - Every Step From Client to Handler and Back

Trace an HTTP request through its full 15+ step lifecycle - DNS, TCP, TLS, load balancer, reverse proxy, ASGI server, middleware, routing, validation, handler, serialisation, and response - with production debugging techniques.

Requirements and Constraints for ML Systems

How to gather, prioritize, and translate business requirements into technical specifications for ML systems - including latency budgets, SLOs, and ML-specific constraints.

Reranking

Master the two-stage retrieval-reranking architecture - cross-encoders, ColBERT, LLM-as-reranker, Reciprocal Rank Fusion, and production latency budgets.

ResearchMath-14K: Scaling Research-Level Mathematics via Agents

The frontier of mathematics is defined by problems whose solutions are not yet known, yet it remains unclear whether language models can meaningfully en...

Responsible Agentic AI

Safety principles, EU AI Act compliance, accountability chains, bias, privacy, red-teaming, and building a safety review process for autonomous agent systems.

Responsible AI and Ethics - Building Systems That Don't Cause Harm

Fairness metrics, bias detection, privacy-preserving ML, model auditing, and the regulatory frameworks every ML engineer must understand.

ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

Reinforcement Learning with Verifiable Rewards (RLVR) enhances reasoning of Large Language Models (LLMs) but usually exhibits limited generation diversi...

REST Principles - Designing APIs That Don't Break Clients

Master REST at engineering depth - Roy Fielding's six constraints, uniform interface, URL design, HTTP method semantics, status codes, pagination patterns, versioning strategies, RFC 7807 error format, and the Richardson Maturity Model.

REST vs gRPC for ML Model Serving

A production engineer's guide to choosing between REST and gRPC for ML APIs - protocol mechanics, performance trade-offs, and when each wins.

Retail Data Engineering

POS data streams, customer data platform architecture, real-time feature computation with Flink, medallion data lake architecture for retail, privacy compliance, and event streaming pipelines for retail ML.

Rethinking Forward Processes for Score-Based Data Assimilation in High Dimensions

Data assimilation is the process of estimating the time-evolving state of a dynamical system by integrating model predictions and noisy observations. It...

Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability

A prevailing narrative in LLM post-training holds that supervised finetuning (SFT) memorizes while reinforcement learning (RL) generalizes. We revisit t...

Rethinking Memory as Continuously Evolving Connectivity

Existing memory-augmented LLM agents often treat memory as a static repository with pre-defined representations and fixed retrieval pipelines, which is...

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understo...

Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems

Reasoning-intensive retrieval aims to surface evidence that supports downstream reasoning rather than merely matching topical similarity. This capabilit...

Rethinking the Diffusion Model from a Langevin Perspective

Diffusion models are often introduced from multiple perspectives, such as VAEs, score matching, or flow matching, accompanied by dense and technically d...

Rethinking the Evaluation of Harness Evolution for Agents

We revisit the evaluation of automatic harness evolution for LLM agents. Existing harness evolution methods use unit test cases to search for harness co...

Rethinking VLM Representation for VLA Initialization

Vision-Language-Action (VLA) models widely adopt pretrained Vision-Language Models (VLMs) as policy backbones, yet it remains unclear what kind of pretr...

Retrieval Algorithms and ANN

Master the approximate nearest neighbor algorithms powering vector search - HNSW, IVF, IVF-PQ, ScaNN, and DiskANN with parameter tuning and recall-latency trade-offs.

Review Queues and Tooling

Building production review interfaces, priority queues, audit trails, reviewer dashboards, and HITL tooling - from Redis-backed queue management to Label Studio integration.

RevieWeaver: Weaving Together Review Insights by Leveraging LLMs and Semantic Similarity.

RevieWeaver: Weaving Together Review Insights by Lev... - published at NAACL 2025.

Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models

We present SemanticQA, an evaluation suite designed to assess language models (LMs) in semantic phrase processing tasks. The benchmark consolidates exis...

Revisiting Gene Ontology Knowledge Discovery with Hierarchical Feature Selection and Virtual Study Group of AI Agents

Large language models have achieved great success in multiple challenging tasks, and their capacity can be further boosted by the emerging agentic AI te...

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

Reinforcement Learning from Human Feedback (RLHF) and related alignment paradigms have become central to steering large language models (LLMs) and multi...

RewardFlow: Generate Images by Optimizing What You Reward

We introduce RewardFlow, an inversion-free framework that steers pretrained diffusion and flow-matching models at inference time through multi-reward La...

RewardUQ: A Unified Framework for Uncertainty-Aware Reward Models

Reward models are central to aligning large language models (LLMs) with human preferences. Yet most approaches rely on pointwise reward estimates that o...

RichRAG: Crafting Rich Responses for Multi-faceted Queries in Retrieval-Augmented Generation.

RichRAG: Crafting Rich Responses for Multi-faceted Q... - published at COLING 2025.

Ring-Zero: Scaling Zero RL to a Trillion Parameters for Emergent Reasoning

Reinforcement learning with verifiable rewards without human-annotated data, often referred to as zero RL, has emerged as a powerful paradigm for elicit...

River-LLM: Large Language Model Seamless Exit Based on KV Share

Large Language Models (LLMs) have demonstrated exceptional performance across diverse domains but are increasingly constrained by high inference latency...

RL for AI Agents - Teaching Models to Act in the World

How RL enables autonomous AI agents: ReAct, tool use, MCTS planning, AlphaCode, SWE-bench, and the emerging agent-RL paradigm powering Claude, GPT-4o, and Gemini.

RL from Human Feedback - How ChatGPT Learned to Be Helpful

The complete RLHF pipeline: supervised fine-tuning, reward model training from human preferences, and PPO fine-tuning - the technique behind InstructGPT, ChatGPT, and Claude.

RL in Production - Where Theory Meets Reality

Engineering challenges of deploying RL: offline RL, reward shaping, safe RL, exploration in production, and real-world case studies from DeepMind, Google, and Netflix.

RLDX-1 Technical Report

While Vision-Language-Action models (VLAs) have shown remarkable progress toward human-like generalist robotic policies through the versatile intelligen...

RLHF and DPO for Open Models

Learn how to align open-source language models with human preferences using RLHF and the simpler, more stable Direct Preference Optimization (DPO) approach with TRL.

RLHF Deep Dive

A complete technical walkthrough of Reinforcement Learning from Human Feedback - the three-phase pipeline, reward models, PPO, KL penalty, and the limitations that led to newer approaches.

RLHF: Reinforcement Learning from Human Feedback

Understand how RLHF aligns LLMs with human preferences through three phases - SFT, reward model training, and PPO - and why it produced InstructGPT's surprising result that smaller aligned models beat larger unaligned ones.

RNNs and the Vanishing Gradient Problem

How recurrent neural networks process sequential data through shared hidden states, and why vanishing gradients cripple their ability to learn long-range dependencies.

RoboCasa365: A Large-Scale Simulation Framework for Training and Benchmarking Generalist Robots

Recent advances in robot learning have accelerated progress toward generalist robots that can perform everyday tasks in human environments. Yet it remai...

RoboDojo: A Unified Sim-and-Real Benchmark for Comprehensive Evaluation of Generalist Robot Manipulation Policies

Generalist robot manipulation policies have advanced rapidly, yet existing benchmarks remain limited in systematically evaluating their capabilities. Ma...

RoboTALES: Learning Reasoning-Guided Robot Policies via Task-Aligned Simulated Futures

Pretrained video generative models are promising backbones for visuomotor control, but their imagined futures often drift from task intent and are not r...

RoboTTT: Context Scaling for Robot Policies

Recent robot foundation models operate with single-step or short-history visuomotor context. We introduce Test-Time-Training Robot Policies (RoboTTT), a...

Robust Reasoning Benchmark

While Large Language Models (LLMs) achieve high performance on standard mathematical benchmarks, their underlying reasoning processes remain highly over...

Robust support vector model based on bounded asymmetric elastic net loss for binary classification

In this paper, we propose a novel bounded asymmetric elastic net ($L_{baen}$) loss function and combine it with the support vector machine (SVM), result...

Robust Unscented Kalman Filtering via Recurrent Meta-Adaptation of Sigma-Point Weights

The Unscented Kalman Filter (UKF) is a ubiquitous tool for nonlinear state estimation; however, its performance is limited by the static parameterizatio...

Robustness of Agentic AI Systems via Adversarially-Aligned Jacobian Regularization

As Large Language Models (LLMs) transition into autonomous multi-agent ecosystems, robust minimax training becomes essential yet remains prone to instab...

Roofline Model and Bottleneck Analysis

Arithmetic intensity, roofline model construction, identifying compute vs memory-bound operations, and using the roofline to guide optimization decisions.

RoPE and ALiBi - Positional Encoding for Long Context

How Rotary Position Embedding encodes relative positions through complex-plane rotations, why ALiBi achieves length extrapolation with linear biases, and why RoPE became the dominant approach for long-context models.

ROSE: An Intent-Centered Evaluation Metric for NL2SQL

Execution Accuracy (EX), the widely used metric for evaluating the effectiveness of Natural Language to SQL (NL2SQL) solutions, is becoming increasingly...

ROSE: Retrieval-Oriented Segmentation Enhancement

Existing segmentation models based on multimodal large language models (MLLMs), such as LISA, often struggle with novel or emerging entities due to thei...

RouteProfile: Elucidating the Design Space of LLM Profiles for Routing

As the large language model (LLM) ecosystem expands, individual models exhibit varying capabilities across queries, benchmarks, and domains, motivating...

Router Mechanisms - How Tokens Get Assigned to Experts

The algorithms that decide which experts process which tokens - linear routing, expert choice, auxiliary load balancing loss, noisy top-k gating, and the Switch Transformer approach.

RT-Lynx: Putting the GEMM Sparsity In a Right Way for Diffusion Models

Diffusion Transformers (DiT) achieve strong performance in image generation but incur substantial inference costs. While prior work has reduced this cos...

RTSM: Knowledge Distillation with Diverse Signals for Efficient Real-Time Semantic Matching in E-Commerce.

RTSM: Knowledge Distillation with Diverse Signals fo... - published at NAACL 2025.

RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains

Pointwise reward modeling offers critical signals for LLM post-training, yet struggles with absolute scoring in subjective, non-verifiable settings. Rub...

RuleChef: Grounding LLM Task Knowledge in Human-Editable Rules

We present RuleChef, a framework that uses large language models (LLMs) to generate executable rules for NLP tasks such as text classification, Named En...

RunAgent: Interpreting Natural-Language Plans with Constraint-Guided Execution

Humans solve problems by executing targeted plans, yet large language models (LLMs) remain unreliable for structured workflow execution. We propose RunA...

Running Vector Databases in Production

Master monitoring, capacity planning, index building strategy, warm-up, disaster recovery, index versioning, gradual rollout, and cost optimization for production vector database operations.

Runtime Type Checking

Validate data at system boundaries using get_type_hints, isinstance limitations, beartype, typeguard, and Pydantic's runtime validation model, and build a custom runtime validator.

RxBrain: Embodied Cognition Foundation Model with Joint Language-Visual Reasoning and Imagination

Embodied cognition requires agents to connect high-level task reasoning with the physical states to be achieved. We introduce Hy-Embodied-RxBrain, an em...

RynnWorld-4D: 4D Embodied World Models for Robotic Manipulation

Robotic manipulation in the open world requires not only recognizing what a scene looks like, but also anticipating how its 3D structure moves under int...

RynnWorld-Teleop: An Action-Conditioned World Model for Digital Teleoperation

Scaling robot learning requires massive, diverse trajectory data, yet collection is currently bottlenecked by physical teleoperation, where every demons...

SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search

Agentic search enables LLMs to solve complex multi-hop questions through iterative reasoning and external search. Despite the effectiveness, these syste...

Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification.

Safe: Enhancing Mathematical Reasoning in Large Lang... - published at ACL 2025.

SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning

Safety guarantees are a prerequisite to the deployment of reinforcement learning (RL) agents in safety-critical tasks. Often, deployment environments ex...

Safety and Bias Evaluation

Evaluate LLMs for harmful outputs, social bias, hallucination, and jailbreak vulnerability - including red teaming methodology and production monitoring.

Safety and Bias Evaluation

Evaluating open-source models for safety and bias before production deployment - red-teaming, toxicity measurement, demographic bias benchmarks, jailbreak robustness, and building end-to-end safety evaluation pipelines.

Safety and Sandboxing

Safety architecture for computer use agents - threat models, prompt injection, Docker sandboxing, action confirmation gates, logging, and anomaly detection.

SAHOO: Safeguarded Alignment for High-Order Optimization Objectives in Recursive Self-Improvement

Recursive self-improvement is moving from theory to practice: modern systems can critique, revise, and evaluate their own outputs, yet iterative self-mo...

Saliency Maps for Vision - What Your CNN Is Actually Seeing

Gradient-based saliency, GradCAM, SmoothGrad, Guided Backpropagation, and Integrated Gradients for explaining computer vision models - with practical code and honest limitations.

SAM-MT: Real-Time Interactive Multi-Target Video Segmentation

Modern Video Object Segmentation (VOS) involves tracking and segmenting user-specified targets. While recent approaches have achieved remarkable perform...

SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent

Long-horizon agentic reasoning requires large language models to act over long interaction histories containing thoughts, tool calls, observations, and...

Sample Complexity Bounds for Stochastic Shortest Path with a Generative Model

We study the sample complexity of learning an $ε$-optimal policy in the Stochastic Shortest Path (SSP) problem. We first derive sample complexity bounds...

Sampling from Constrained Gibbs Measures: with Applications to High-Dimensional Bayesian Inference

This paper considers a non-standard problem of generating samples from a low-temperature Gibbs distribution with mph{constrained} support, when some o...

Sampling Strategies: Temperature, Top-K, Top-P

Master the sampling algorithms that control LLM output diversity - from greedy decoding to nucleus sampling - and learn when to use each in production.

SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer

Real-time streaming video-to-video editing (V2V) is critical for interactive applications such as live broadcasting and gaming, yet it remains a formida...

Sandboxing Agent Environments

Contain the blast radius of any agent failure - process isolation, Docker security hardening, network policy, E2B cloud sandboxes, and escape vector prevention.

SAVGO: Learning State-Action Value Geometry with Cosine Similarity for Continuous Control

While representation and similarity learning have improved the sample efficiency of Reinforcement Learning (RL), they are rarely used to shape policy up...

SAVOIR: Learning Social Savoir-Faire via Shapley-based Reward Attribution

Social intelligence, the ability to navigate complex interpersonal interactions, presents a fundamental challenge for language agents. Training such age...

Scalable Evaluation of the Realism of Synthetic Environmental Augmentations in Images

Evaluation of AI systems often requires synthetic test cases, particularly for rare or safety-critical conditions that are difficult to observe in opera...

Scalable Learning of Multivariate Distributions via Coresets

Efficient and scalable non-parametric or semi-parametric regression analysis and density estimation are of crucial importance to the fields of statistic...

Scalable Visual Pretraining for Language Intelligence

The rapid progress of large foundation models has been driven predominantly by pretraining on large-scale text corpora. However, many forms of knowledge...

Scaling Continual Learning to 300+ Tasks with Bi-Level Routing Mixture-of-Experts

Continual learning, especially class-incremental learning (CIL), on the basis of a pre-trained model (PTM) has garnered substantial research interest in...

Scaling Laws

Empirical power-law relationships between LLM performance and compute, data, and parameters - from Kaplan (2020) to Chinchilla (2022) and beyond.

Scaling Mixture-of-Experts Video Pretraining for Embodied Intelligence

Despite the recent promise in robot control, video generative models suffer from a domain mismatch due to their primary focus on content creation. For e...

Scaling Search Relevance: Augmenting App Store Ranking with LLM-Generated Judgments

Large-scale commercial search systems optimize for relevance to drive successful sessions that help users find what they are looking for. To maximize re...

Scaling Teams or Scaling Time? Memory Enabled Lifelong Learning in LLM Multi-Agent Systems

Large language model (LLM) multi-agent systems can scale along two distinct dimensions: by increasing the number of agents and by improving through accu...

Scaling Test-Time Compute for Agentic Coding

Test-time scaling has become a powerful way to improve large language models. However, existing methods are best suited to short, bounded outputs that c...

Scaling Vector Databases to Billions of Vectors

Architect horizontal sharding, replication, consistent hashing, hot-cold tiering, distributed HNSW, geographic distribution, and backup strategies for production vector databases at billion-vector scale.

SceneFrom3D: Geometry-Conditioned Outdoor 3D Scene Generation via View Scheduling with Object-Level Control

Geometry-conditioned 3D scene generation enables the creation of 3D environments from user-provided geometry, offering direct control over scene structu...

ScheMatiQ: From Research Question to Structured Data through Interactive Schema Discovery

Many disciplines pose natural-language research questions over large document collections whose answers typically require structured evidence, tradition...

SciClaims: An End-to-End Generative System for Biomedical Claim Analysis.

SciClaims: An End-to-End Generative System for Biome... - published at EMNLP 2025.

Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning

Graphics Program Synthesis is pivotal for interpreting and editing visual data, effectively facilitating the reverse-engineering of static visuals into...

ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

Autonomous research agents produce competitive solutions and professional-looking manuscripts, yet their outputs contain verifiability failures undetect...

Scikit-Learn Pipelines

Build production-grade scikit-learn Pipelines - ColumnTransformer, custom transformers, caching, cross-validation without leakage, hyperparameter search, and model serialization.

SciLT: Long-Tailed Classification in Scientific Image Domains

Long-tailed recognition has benefited from foundation models and fine-tuning paradigms, yet existing studies and benchmarks are mainly confined to natur...

SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?

Accelerating scientific discovery requires the identification of which experiments would yield the best outcomes before committing resources to costly p...

SCOPE: Scene-Contextualized Incremental Few-Shot 3D Segmentation

Incremental Few-Shot (IFS) segmentation aims to learn new categories over time from only a few annotations. Although widely studied in 2D, it remains un...

SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks

Self-play can train language models without external supervision. However, existing methods require rule-checkable answers, leaving open-ended tasks dep...

SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting

On-policy reinforcement learning has become the dominant paradigm for reasoning alignment in large language models, yet its sparse, outcome-level reward...

Score-Based Generative Models - Diffusion Through the Lens of Score Matching

How Song and Ermon's score matching framework unifies DDPM and enables stochastic differential equations for continuous-time diffusion - the mathematical theory behind modern diffusion models, from score functions and Langevin dynamics through denoising score matching and the SDE unification.

Script-Agnosticism and its Impact on Language Identification for Dravidian Languages.

Script-Agnosticism and its Impact on Language Identi... - published at NAACL 2025.

sDPO: Don't Use Your Data All at Once.

sDPO: Don't Use Your Data All at Once. - published at COLING 2025.

SEAL: Synergistic Co-Evolution of Agents and Learning Environments

Large Language Model (LLM) agents are increasingly improved through interaction, yet most self-evolution methods adapt either the policy or the learning...

SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages.

SeaLLMs 3: Open Foundation and Chat Multilingual Lar... - published at NAACL 2025.

Search and Retrieval Systems

Redesigning an Elasticsearch-only search system with neural search - from BM25 baseline through dense retrieval, learning to rank, query understanding, and search quality evaluation.

Search Beyond What Can Be Taught: Evolving the Knowledge Boundary in Agentic Visual Generation

Visual generators excel at rendering, but they confidently fabricate what they do not know. User requests are unbounded, evolving, and deeply long-taile...

SearchOS-V1: Towards Robust Open-Domain Information-Seeking Agent Collaboration

Recent advances in Tool-Integrated Large Language Models have made web search a core capability of information-seeking agents. However, as interaction h...

Secrets Management

Manage secrets securely with python-dotenv, Pydantic SecretStr, AWS Secrets Manager, HashiCorp Vault, git-secrets, and production credential rotation strategies.

Secure Coding Patterns

Apply defense in depth, least privilege, CORS, rate limiting, CSP headers, dependency auditing with pip-audit, and static analysis with bandit to harden FastAPI applications.

Securing RAG Systems

Attack surfaces unique to RAG architectures - document poisoning, retrieval hijacking, indirect prompt injection, embedding collision, cross-tenant leakage, and defense-in-depth strategies for production RAG deployments.

SEED: Self-Evolving On-Policy Distillation for Agentic Reinforcement Learning

Large language models are increasingly trained as interactive agents for long-horizon tasks involving multi-turn interaction, tool use, and environment...

Seedance 2.0: Advancing Video Generation for World Complexity

Seedance 2.0 is a new native multi-modal audio-video generation model, officially released in China in early February 2026. Compared with its predecesso...

Seeing Fast and Slow: Learning the Flow of Time in Videos

How can we tell whether a video has been sped up or slowed down? How can we generate videos at different speeds? Although videos have been central to mo...

Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?

Spatial reasoning is a fundamental capability for vision-language models (VLMs) deployed in real-world environments. However, visual observations are in...

Seeing the Needle in the Haystack: Towards Weakly-Supervised Log Instance Anomaly Localization via Counterfactual Perturbation

Log anomaly detection is a critical task for system operations and security assurance. However, in networked systems at scale, log data are generated at...

Seeing Through Touch: Tactile-Driven Visual Localization of Material Regions

We address the problem of tactile localization, where the goal is to identify image regions that share the same material properties as a tactile input....

SELDON: Supernova Explosions Learned by Deep ODE Networks

The discovery rate of optical transients will explode to 10 million public alerts per night once the Vera C. Rubin Observatory's Legacy Survey of Space...

Selecting GPUs for Training vs Inference

H100 vs A100 vs L40S vs RTX 4090 vs A10G - a practical decision framework for matching GPU specifications to training and inference workload requirements.

Selecting Target Modules and Rank

Which layers to apply LoRA to and what rank to use - two of the most impactful fine-tuning decisions. Covers attention vs FFN targeting, rank selection from r=4 to r=64, RSLoRA, DoRA, LoRA+, and ablation strategies.

Self in Space: Benchmarking Self-Awareness and Spatial Cognition in UAV Embodied Intelligence

Autonomous UAV systems increasingly rely on multimodal large language models (MLLMs) to operate in complex real-world environments. Such embodied scenar...

Self-Adversarial One Step Generation via Condition Shifting

The push for efficient text to image synthesis has moved the field toward one step sampling, yet existing methods still face a three way tradeoff among...

Self-Attention Mechanism

How self-attention computes query, key, and value interactions to capture long-range dependencies between tokens.

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

Current post-training methods in verifiable settings fall into two categories. Reinforcement learning (RLVR) relies on binary rewards, which are broadly...

Self-Distilled Agentic Reinforcement Learning

Reinforcement learning (RL) has emerged as a central paradigm for post-training LLM agents, yet its trajectory-level reward signal provides only coarse...

Self-Distilled RLVR

On-policy distillation (OPD) has become a popular training paradigm in the LLM community. This paradigm selects a larger model as the teacher to provide...

Self-Evolving LLM Memory Extraction Across Heterogeneous Tasks

As LLM-based assistants become persistent and personalized, they must extract and retain useful information from past conversations as memory. However,...

Self-Execution Simulation Improves Coding Models

A promising research direction in enabling LLMs to generate consistently correct code involves addressing their inability to properly estimate program e...

Self-Guided Test-Time Training for Long-Context LLMs

Long-context processing has become increasingly important for large language models (LLMs), but simply extending the context window does not guarantee e...

Self-Improvements in Modern Agentic Systems: A Survey

Self-improving autonomous agents are moving from research prototypes to deployed systems. The primary goal is controllable evolution, or adaptation, fro...

Self-Improving Language Models with Bidirectional Evolutionary Search

Search has been proposed as an effective method for self-improving language models and agentic systems, both for post-training sample generation and for...

Self-Instruct

How the Self-Instruct paper bootstrapped instruction-following datasets from a tiny seed set using GPT-3, enabling the Alpaca era of aligned models - and how to implement it today.

Self-Service ML Platform

Build ML platforms that data scientists actually use - applying product thinking to internal tooling, from user research and notebook-to-production workflows to adoption metrics and guardrails.

Self-Sovereign Agent

We investigate the emerging prospect of self-sovereign agents -- AI systems that can economically sustain and extend their own operation without human i...

Sema Code: Decoupling AI Coding Agents into Programmable, Embeddable Infrastructure

AI coding agents have become central to developer workflows, yet every existing solution locks its reasoning capabilities within a specific delivery for...

SemaClaw: A Step Towards General-Purpose Personal AI Agents through Harness Engineering

The rise of OpenClaw in early 2026 marks the moment when millions of users began deploying personal AI agents into their daily lives, delegating tasks r...

Semantic Caching

Return cached LLM responses for semantically similar queries using embedding-based vector similarity. Cut costs 40–60% by never paying for the same question twice regardless of how it is phrased.

Semantic Memory and Knowledge Graphs

Structured world knowledge for agents: building and querying knowledge graphs with entity extraction, relationship traversal, and hybrid vector+graph retrieval.

Semantic Richness or Geometric Reasoning? The Fragility of VLM's Visual Invariance

This work investigates the fundamental fragility of state-of-the-art Vision-Language Models (VLMs) under basic geometric transformations. While modern V...

Semantic Segmentation

Pixel-wise classification with FCN, U-Net, DeepLab atrous convolutions, encoder-decoder architectures, instance segmentation with Mask R-CNN, and full PyTorch U-Net implementation.

Semantic Token Clustering for Efficient Uncertainty Quantification in Large Language Models

Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks. However, the truthfulness of their outputs is not guarantee...

Semantic Versioning - The Contract Behind Every Version Number

Master Semantic Versioning at engineering depth - MAJOR.MINOR.PATCH definitions, breaking change classification, Python version specifiers, pre-release ordering, CalVer, changelog discipline, and Git tagging for releases.

Semantics-Aware Caching for Concept Learning

Concept learning is a form of supervised machine learning that operates on knowledge bases in description logics. State-of-the-art concept learners ofte...

SemBridge: Language Transfer in Sparse Encoders via Multilingual Semantic Bridges

Sparse encoders offer high-precision retrieval by representing term importance within a vocabulary space, yet their English-centric structures pose a cr...

Semi-automatic Sequential Sentence Classification in the Discourse Analysis Tool Suite.

Semi-automatic Sequential Sentence Classification in... - published at NAACL 2025.

Semi-Supervised Generative Learning via Latent Space Distribution Matching

We introduce Latent Space Distribution Matching (LSDM), a novel framework for semi-supervised generative modeling of conditional distributions. LSDM ope...

SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching

Diffusion models achieve state-of-the-art video generation quality, but their inference remains expensive due to the large number of sequential denoisin...

Sens-Merging: Sensitivity-Guided Parameter Balancing for Merging Large Language Models.

Sens-Merging: Sensitivity-Guided Parameter Balancing... - published at ACL 2025.

Sentiment Analysis of German Sign Language Fairy Tales

We present a dataset and a model for sentiment analysis of German sign language (DGS) fairy tales. First, we perform sentiment analysis for three levels...

Seq2Seq and Encoder-Decoder Architectures

How encoder-decoder networks with attention solve variable-length sequence-to-sequence problems - from machine translation to summarization and code generation.

Sequential Inference for Gaussian Processes: A Signal Processing Perspective

The proliferation of capable and efficient machine learning (ML) models marks one of the strongest methodological shifts in signal processing (SP) in it...

Serialization and Data Formats

Master serialization formats for ML systems - Protocol Buffers, Apache Arrow, safetensors, Parquet, HDF5, MessagePack, and pickle - with performance benchmarks, security considerations, and schema evolution strategies.

Service Mesh and Load Balancing

Master service mesh architecture and load balancing for ML serving - Istio, Envoy, traffic management, mTLS, canary deployments, circuit breaking, and Kubernetes networking for production AI systems.

Service Mesh for ML Serving

Use Istio service mesh to manage traffic routing across multiple ML model versions - canary deployments, A/B testing, circuit breakers, and telemetry.

Serving Architectures: REST vs gRPC vs WebSocket

How to choose the right serving protocol for ML models - REST, gRPC, and WebSocket compared across latency, throughput, streaming, and operational complexity.

Sessa: Selective State Space Attention

Modern sequence modeling is dominated by two families: Transformers, whose self-attention can access arbitrary elements of the visible sequence, and str...

SEVerA: Verified Synthesis of Self-Evolving Agents

Recent advances have shown the effectiveness of self-evolving LLM agents on tasks such as program repair and scientific discovery. In this paradigm, a p...

Shadow Deployment for Safe Model Releases

How to validate new ML models on real production traffic without affecting users - traffic mirroring, prediction comparison, and graduation criteria.

Shadow Mode Testing

Run new ML models against live production traffic without affecting users - catching silent failures, latency regressions, and behavioral differences before go-live.

ShadowPEFT: Shadow Network for Parameter-Efficient Fine-Tuning

Parameter-efficient fine-tuning (PEFT) reduces the training cost of full-parameter fine-tuning for large language models (LLMs) by training only a small...

SHAP Values - The Unified Theory of Feature Importance

Shapley values from cooperative game theory provide the only provably fair attribution of feature contributions to a model's prediction - and SHAP makes them computationally tractable.

Share More, Search Less: Collaborative Parallel Thinking for Efficient Test-Time Scaling

Test-Time Scaling (TTS) enhances the reasoning capabilities of large language models by allocating additional inference compute to explore the solution...

SHARE: Social-Humanities AI for Research and Education

This intermediate technical report introduces the SHARE family of base models and the MIRROR user interface. The SHARE models are the first causal langu...

Sharp Convergence Rates for Masked Diffusion Models

Discrete diffusion models have achieved strong empirical performance in text and other symbolic domains, with masked (absorbing-rate) variants emerging...

Sharp description of local minima in the loss landscape of high-dimensional two-layer ReLU neural networks

We study the population loss landscape of two-layer ReLU networks of the form $\sum_{k=1}^K \mathrm{ReLU}(w_k^\top x)$ in a realisable teacher-student s...

Shell Scripting for ML Workflows

Bash scripting for ML engineers - automating training launches, multi-node coordination, GPU monitoring, checkpoint management, parallel data downloads, and writing robust production-grade shell scripts.

ShortOPD: Recovering Pruned LLMs with Short-to-Long On-Policy Distillation

Structured pruning is a hardware-friendly way to compress LLMs, but it is mostly validated on multiple-choice recognition tasks, while the same compress...

SiamJEPA: On the Role of Siamese Student Encoders in JEPA

Recently, Joint Embedding Predictive Architectures (JEPAs) have attracted significant attention in the computer vision and machine learning communities...

SIEVE: Structure-Aware Data Selection for Imitation Learning with VLA Models

Vision-Language-Action (VLA) models are typically trained by imitation learning on large-scale robot demonstration datasets, but more data does not nece...

Signals and IPC for ML

Unix signals, graceful shutdown patterns, shared memory, pipes, Unix domain sockets, and ZeroMQ for building reliable multi-process ML training and serving systems.

Significance and Stability Analysis of Gene-Environment Interaction using RGxEStat

Genotype-by-Environment (GxE) interactions influence the performance of genotypes across diverse environments, reducing the predictability of phenotypes...

Sim-to-Real Transfer for Muscle-Actuated Robots via Generalized Actuator Networks

Tendon drives paired with soft muscle actuation enable faster and safer robots while potentially accelerating skill acquisition. Still, these systems ar...

SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds

Robotic manipulation with deformable objects represents a data-intensive regime in embodied learning, where shape, contact, and topology co-evolve in wa...

SIMD and Vectorization

Learn how SIMD instruction sets (SSE, AVX2, AVX-512) enable CPUs to process 8 to 16 floating-point operations per cycle, why NumPy and PyTorch use them by default, and how to write code that compilers can auto-vectorize.

SimpliHuMoN: Simplifying Human Motion Prediction

Human motion prediction combines the tasks of trajectory forecasting and human pose prediction. For each of the two tasks, specialized models have been...

SimuWoB: Simulating Real-World Mobile Apps for Fast and Faithful GUI Agent Benchmarking

Mobile GUI agents powered by large language models have progressed rapidly, creating urgent needs for realistic and comprehensive evaluation. Existing b...

Single-Rollout Asynchronous Optimization for Agentic Reinforcement Learning

Reinforcement learning (RL) is becoming increasingly important for post-training large language models (LLMs). Previous RL pipelines for LLMs were mostl...

Sketching the Readout of Large Language Models for Scalable Data Attribution and Valuation

Data attribution and valuation are critical for understanding data-model synergy for Large Language Models (LLMs), yet existing gradient-based methods s...

Skill Reuse as Compression in Agentic RL

Large language model agents trained with reinforcement learning (RL) often learn brittle, task-specific shortcuts. We hypothesize that agents generalize...

Skill0.5: Joint Skill Internalization and Utilization for Out-of-Distribution Generalization in Agentic Reinforcement Learning

Equipping large language models with explicit skills has emerged as a promising paradigm for enabling autonomous agents to solve complex tasks. Agent sk...

Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

A persistent skill library allows language model agents to reuse successful strategies across tasks. Maintaining such a library requires three coupled c...

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

Large language model (LLM) agents such as OpenClaw rely on reusable skills to perform complex tasks, yet these skills remain largely static after deploy...

SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills

Large language model (LLM) agents accumulate rich episodic trajectories while solving real-world tasks, but it remains unclear whether such experience c...

SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

As the capability frontier of autonomous agents continues to expand, they are increasingly able to complete specialized tasks through plug-and-play exte...

SkillGrad: Optimizing Agent Skills Like Gradient Descent

Agent skills provide a lightweight way to adapt LLM agents to specialized domains by storing reusable procedural knowledge in structured files. However,...

SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks

Skills have become the de facto way to enable LLM agents to perform complex real-world tasks with customized instructions, workflows, and tools, but how...

SkillOpt-Lite: Better and Faster Agent Self-evolution via One Line of Vibe

While skill optimization for autonomous agents has gained traction, existing methods rely on complex pipelines. This leaves a fundamental question unadd...

SkillOS: Learning Skill Curation for Self-Evolving Agents

LLM-based agents are increasingly deployed to handle streaming tasks, yet they often remain one-off problem solvers that fail to learn from past interac...

Skills-Coach: A Self-Evolving Skill Optimizer via Training-Free GRPO

We introduce Skills-Coach, a novel automated framework designed to significantly enhance the self-evolution of skills within Large Language Model (LLM)-...

SkillX: Automatically Constructing Skill Knowledge Bases for Agents

Learning from experience is critical for building capable large language model (LLM) agents, yet prevailing self-evolving paradigms remain inefficient:...

SkVM: Compiling Skills for Efficient Execution Everywhere

LLM agents increasingly adopt skills as a reusable unit of composition. While skills are shared across diverse agent platforms, current systems treat th...

SlackAgents: Scalable Collaboration of AI Agents in Workspaces.

SlackAgents: Scalable Collaboration of AI Agents in... - published at EMNLP 2025.

SLERP - Spherical Linear Interpolation

How spherical linear interpolation provides smoother, geometrically correct blending between two model weight configurations than simple linear averaging.

Small Vision-Language Models are Smart Compressors for Long Video Understanding

Adapting Multimodal Large Language Models (MLLMs) for hour-long videos is bottlenecked by context limits. Dense visual streams saturate token budgets an...

SmartDirector: Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing Control

The narrative quality of a video fundamentally determines its perceptual value. Although existing video generation methods can produce visually appealin...

Smarter and Cheaper at Once: Byte-Exact KV-Cache Grafting Turns a Frozen Small Model into a Verified-Knowledge Flywheel

We report a way to make a frozen small language model both more capable and dramatically cheaper at once, without changing any weights. Verified knowled...

SmartPhotoCrafter: Unified Reasoning, Generation and Optimization for Automatic Photographic Image Editing

Traditional photographic image editing typically requires users to possess sufficient aesthetic understanding to provide appropriate instructions for ad...

Snowflake for ML

Snowflake architecture, Snowpark, and ML feature serving from Snowflake.

Soap2Soap: Long Cinematic Video Remaking via Multi-Agent Collaboration

We study series-level cinematic remaking, a long-horizon video-to-video generation problem that localizes full episodes or films via stylization or acto...

SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs.

SoftCoT: Soft Chain-of-Thought for Efficient Reasoni... - published at ACL 2025.

SOLID Principles in Python - Engineering Patterns for Maintainable Code

Master all five SOLID principles with Python-specific implementations - SRP with module-level decomposition, OCP with typing.Protocol, LSP violations and their consequences, ISP with small focused ABCs, and DIP with constructor injection. Production code examples for each.

Solving Physics Olympiad via Reinforcement Learning on Physics Simulators

We have witnessed remarkable advances in LLM reasoning capabilities with the advent of DeepSeek-R1. However, much of this progress has been fueled by th...

Sorting and Search for ML

Sorting algorithms and search techniques for ML engineers - from timsort internals and top-k selection to binary search for hyperparameter tuning, FAISS IVF indexes, and beam search with priority queues.

SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport

The Platonic Representation Hypothesis posits that neural networks trained on different modalities converge toward a shared statistical model of the wor...

SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?

Autonomous AI research agents aim to accelerate scientific discovery by automating the research pipeline, from hypothesis generation to peer review. How...

Spark for ML Pipelines

Building production ML feature pipelines with PySpark - window functions, Pandas UDFs, MLlib Pipelines, point-in-time joins, and Delta Lake integration.

Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance

In recent years, open-source efforts like Senorita-2M have propelled video editing toward natural language instruction. However, current publicly availa...

Sparse Delta Memory: Scaling the State of Linear RNNs through Sparsity

Linear attention models allow a fixed state size and a fixed amount of compute per token. However, due to their limited state size, linear attention mod...

Sparse vs Dense Models - Trade-offs

Why MoE gives more capacity per FLOP than dense models, the memory vs. compute trade-off, training efficiency, inference complexity, and when to choose each architecture.

SPASM: Stable Persona-driven Agent Simulation for Multi-turn Dialogue Generation

Large language models are increasingly deployed in multi-turn settings such as tutoring, support, and counseling, where reliability depends on preservin...

Spatial Competence Benchmark

Spatial competence is the quality of maintaining a consistent internal representation of an environment and using it to infer discrete structure and pla...

SpatialBench: Is Your Spatial Foundation Model an All-Round Player?

While spatial foundation models have demonstrated impressive performance on standard datasets, a critical question remains: are they truly all-round pla...

SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing

Image spatial editing performs geometry-driven transformations, allowing precise control over object layout and camera viewpoints. Current models are in...

SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments

Spatial reasoning over three-dimensional scenes is a core capability for embodied intelligence, yet continuous model improvement remains bottlenecked by...

SPEAR: A Simulator for Photorealistic Embodied AI Research

Interactive simulators have become powerful tools for training embodied agents and generating synthetic visual data, but existing photorealistic simulat...

Spec Kit Agents: Context-Grounded Agentic Workflows

Spec-driven development (SDD) with AI coding agents provides a structured workflow, but agents often remain 'context blind' in large, evolving repositor...

Specialized Inference Hardware

Compare AWS Inferentia/Trainium, NVIDIA L4/L40S, edge inference hardware (Jetson, Apple Neural Engine), hardware-specific quantization, and cost-performance tradeoffs for production AI inference.

Spectral Alignment in Forward-Backward Representations via Temporal Abstraction

Forward-backward (FB) representations provide a powerful framework for learning the successor representation (SR) in continuous spaces by enforcing a lo...

Spectral Rewiring for Exploration, Purification, and Model Merging

Reinforcement learning has become a standard post-training recipe for large language models, but dense full-parameter updates create two deployment-rele...

Speculative Decoding

How speculative decoding uses a small draft model to generate candidate tokens verified by the large target model in a single forward pass, achieving 2-3x inference speedups without changing output distribution.

Speculative Decoding

Learn how speculative decoding uses a small draft model to generate tokens that a large target model verifies in parallel, achieving 2-3x speedup with no quality loss.

Speculative Decoding for Autoregressive Video Generation

Autoregressive video diffusion is emerging as a promising paradigm for streaming video synthesis, with step distillation serving as the primary means of...

SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

Speculative Decoding (SD) has emerged as a critical technique for accelerating Large Language Model (LLM) inference. Unlike deterministic system optimiz...

SplAttN: Bridging 2D and 3D with Gaussian Soft Splatting and Attention for Point Cloud Completion

Although multi-modal learning has advanced point cloud completion, the theoretical mechanisms remain unclear. Recent works attribute success to the conn...

SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding

Large Audio-Language Models (ALMs) have recently demonstrated remarkable capabilities in holistic audio understanding, yet they remain unreliable for te...

SPPCSO: Adaptive Penalized Estimation Method for High-Dimensional Correlated Data

With the rise of high-dimensional correlated data, multicollinearity poses a significant challenge to model stability, often leading to unstable estimat...

SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

Proximal Policy Optimization (PPO) is central to aligning Large Language Models (LLMs) in reasoning tasks with verifiable rewards. However, standard tok...

SPRITE: From Static Mockups to Engine-Ready Game UI

Game UI implementation requires translating stylized mockups into interactive engine entities. However, current 'Screenshot-to-Code' tools often struggl...

Spurious Predictability in Financial Machine Learning

Adaptive specification search generates statistically significant backtests even under martingale-difference nulls. We introduce a falsification audit t...

SQL at Scale for ML Feature Engineering

Writing production SQL for 10-billion-row datasets - partition pruning, window functions, approximate aggregates, BigQuery optimization, DuckDB, and Spark SQL patterns for ML feature preparation.

SQL Injection Prevention

Prevent SQL injection through parameterized queries, SQLAlchemy best practices, ORM safety limits, raw SQL auditing, and defense against UNION, blind, and second-order injection.

Squeez: Task-Conditioned Tool-Output Pruning for Coding Agents

Coding agents repeatedly consume long tool observations even though only a small fraction of each observation matters for the next step. We study task-c...

Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation

Subject-driven image generation aims to synthesize new images that preserve the identity of the given subject while following textual instructions. Exis...

Stable and Steerable Sparse Autoencoders with Weight Regularization

Sparse autoencoders (SAEs) are widely used to extract human-interpretable features from neural network activations, but their learned features can vary...

StableI2I: Spotting Unintended Changes in Image-to-Image Transition

In most real-world image-to-image (I2I) scenarios, existing evaluations primarily focus on instruction following and the perceptual quality or aesthetic...

Stacking and Blending

Master stacking and blending ensemble techniques - out-of-fold meta-learning, data leakage prevention, model diversity, snapshot ensembling, temporal ensembling, Kaggle competition patterns, and production deployment tradeoffs.

STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation

Deep generative models have advanced rapidly across text and vision, motivating unified multimodal systems that can understand, reason over, and generat...

Stargazer: A Scalable Model-Fitting Benchmark Environment for AI Agents under Astrophysical Constraints

The rise of autonomous AI agents suggests that dynamic benchmark environments with built-in feedback on scientifically grounded tasks are needed to eval...

State estimations and noise identifications with intermittent corrupted observations via Bayesian variational inference

This paper focuses on the state estimation problem in distributed sensor networks, where intermittent packet dropouts, corrupted observations, and unkno...

State Space Model Foundations

How control theory's state space models became a competitive sequence modeling architecture - continuous-time SSMs, the S4 paper, HiPPO initialization, and the convolutional/recurrent duality.

StateSMix: Online Lossless Compression via Mamba State Space Models and Sparse N-gram Context Mixing

We present StateSMix, a fully self-contained lossless compressor that couples an online-trained Mamba-style State Space Model (SSM) with sparse n-gram c...

Static Analysis and Type Systems

Build type-safe ML codebases using Python type hints, mypy strict mode, pydantic v2 validation, Protocol types, jaxtyping tensor shape annotations, and ruff for fast linting.

Static Analysis in Practice

Configure mypy and pyright for strict mode, gradual typing adoption, type stubs, py.typed markers, CI integration, and strategies for migrating untyped codebases.

Statistical Foundations for A/B Testing

Learn the statistical machinery behind A/B testing - null hypotheses, p-values, power, sample size calculation, and the mistakes that invalidate ML experiments.

Statistical Inference for Score Decompositions

We introduce inference methods for score decompositions, which partition scoring functions for predictive assessment into three interpretable components...

Steering Visual Generation in Unified Multimodal Models with Understanding Supervision

Unified multimodal models are envisioned to bridge the gap between understanding and generation. Yet, to achieve competitive performance, state-of-the-a...

Step-level Optimization for Efficient Computer-use Agents

Computer-use agents provide a promising path toward general software automation because they can interact directly with arbitrary graphical user interfa...

Stochastic and Mini-Batch Gradient Descent

Master SGD and mini-batch gradient descent - gradient noise as implicit regularization, convergence proof sketch with decreasing lr, batch size vs generalization, linear scaling rule, cyclic LR, full PyTorch DataLoader training, and distributed SGD.

Storage Formats for ML Training Data

Why Parquet, Avro, ORC, and Delta Lake exist, how columnar storage enables fast ML pipelines, and how to tune storage formats for maximum throughput and minimum cost.

Storage Hierarchy: SSD and NVMe

Deep dive into SSD and NVMe storage architecture for ML workloads - NAND flash physics, NVMe protocol, io_uring async I/O, memory-mapped datasets, and designing storage systems for large-scale training.

Storage IO for Training Pipelines

How storage IO bottlenecks GPU utilization in ML training, NVMe and distributed filesystem characteristics, data loading patterns with WebDataset and DALI, prefetching strategies, and designing checkpointing that does not stall your cluster.

Strait: Perceiving Priority and Interference in ML Inference Serving

Machine learning (ML) inference serving systems host deep neural network (DNN) models and schedule incoming inference requests across deployed GPUs. How...

StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

Large language models (LLMs) are increasingly used as interactive agents, but optimizing them for long-horizon decision making remains difficult because...

Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play

Games offer a compelling paradigm for developing general reasoning capabilities in language models, as they naturally demand strategic planning, probabi...

Stream Processing for ML Systems

Continuous feature computation on unbounded data streams using Apache Flink - windowing, watermarks, state management, and production ML feature pipelines.

Stream Processing Patterns for ML Pipelines

Seven production design patterns for streaming ML pipelines - stream enrichment, stream-stream joins, CDC to feature store, streaming inference, feedback loops, and exactly-once end-to-end.

Stream Processing with Kafka for Real-Time ML

How Apache Kafka and Flink enable real-time ML features - topics, consumer groups, exactly-once semantics, streaming feature computation, and architecture patterns.

Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation

Distillation-based acceleration has become foundational for making autoregressive streaming video diffusion models practical, with distribution matching...

Stream-T1: Test-Time Scaling for Streaming Video Generation

While Test-Time Scaling (TTS) offers a promising direction to enhance video generation without the surging costs of training, current test-time video ge...

Stream-to-Feature Pipelines

Computing features from event streams with Kafka and Flink.

STREAM: A Data-Centric Framework for Mining High-Value Task-Oriented Dialogues from Streaming Media

Large language models for vertical domains are bottlenecked by the scarcity of complex, domain-specific task-oriented dialogues. Existing data acquisiti...

Streaming Concepts - Why Batch Fails for Real-Time ML

The fundamental theory of stream processing - event time, processing time, watermarks, windowing, delivery semantics, and backpressure - through the lens of ML systems that cannot afford batch latency.

Streaming Inference

Running ML inference on data streams - Kafka integration, Flink ML, stateful stream processing, windowed feature aggregations, exactly-once inference, and time semantics.

Streaming Multiprocessors

The SM is the fundamental execution unit of every NVIDIA GPU - warp schedulers, register files, shared memory, occupancy, and how thread block configuration determines performance.

Streaming Pipeline Reliability for ML Systems

How to build streaming ML pipelines that survive failures, handle schema changes, implement dead letter queues, replay events, and monitor themselves - so your fraud model never runs on 3-hour-old features again.

Streaming Responses

Implementing and optimizing streaming for real-time LLM response delivery - SSE, chunking strategies, backpressure, tool use streaming, and production patterns for perceived performance.

Streaming Structured Inference with Flash-SemiCRF

Semi-Markov Conditional Random Fields (semi-CRFs) assign labels to segments of a sequence rather than to individual positions, enabling exact inference...

Streaming UX for LLMs

Server-sent events, streaming tokens, TTFT optimization, and building responsive AI chat interfaces that feel instant even under production load.

Strips as Tokens: Artist Mesh Generation with Native UV Segmentation

Recent advancements in autoregressive transformers have demonstrated remarkable potential for generating artist-quality meshes. However, the token order...

Structural Graph Probing of Vision-Language Models

Vision-language models (VLMs) achieve strong multimodal performance, yet how computation is organized across populations of neurons remains poorly under...

Structural interpretability in SVMs with truncated orthogonal polynomial kernels

We study post-training interpretability for Support Vector Machines (SVMs) built from truncated orthogonal polynomial kernels. Since the associated repr...

Structure-Preserving Multi-View Embedding Using Gromov-Wasserstein Optimal Transport

Multi-view data analysis seeks to integrate multiple representations of the same samples in order to recover a coherent low-dimensional structure. Class...

Structured Causal Video Reasoning via Multi-Objective Alignment

Human understanding of video dynamics is typically grounded in a structured mental representation of entities, actions, and temporal relations, rather t...

Structured Concurrency with TaskGroup

Master asyncio.TaskGroup for safe concurrent execution, understand why gather() leaks tasks, handle ExceptionGroups, and implement the nursery pattern.

Structured Distillation of Web Agent Capabilities Enables Generalization

Frontier LLMs can navigate complex websites, but their cost and reliance on third-party APIs make local deployment impractical. We introduce Agent-as-An...

Structured Generation in Production

Production-grade architecture for structured generation pipelines - reliability stacks, schema versioning, monitoring, async batching, caching, edge case handling, and complete reference implementations.

Structured Output and JSON Mode

Reliably extract structured data from LLMs using JSON mode, function calling, Pydantic validation, and constrained decoding - the backbone of production LLM pipelines.

Structured Pruning

Remove entire attention heads, MLP neurons, and transformer layers to achieve real hardware latency improvements - with production-grade code for Taylor importance, angular distance layer scoring, iterative recovery, and combined compression pipelines.

Structured Tender Entities Extraction from Complex Tables with Few-short Learning.

Structured Tender Entities Extraction from Complex T... - published at COLING 2025.

Student Performance Prediction

Learn how to build early warning systems for at-risk students, predict dropout and grades, audit for fairness, and design interventions using ML on LMS engagement data.

StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition

Creative face stylization aims to render portraits in diverse visual idioms such as cartoons, sketches, and paintings while retaining recognizable ident...

Stylistic-STORM (ST-STORM) : Perceiving the Semantic Nature of Appearance

One of the dominant paradigms in self-supervised learning (SSL), illustrated by MoCo or DINO, aims to produce robust representations by capturing featur...

SUFLECA: Scaling Up Feature Learning for CAD-to-image Alignment

CAD-to-image alignment aims to estimate an object's 9D pose (rotation, translation, and anisotropic scale) from a single RGB image, enabling application...

SuperLocalMemory V3.3: The Living Brain -- Biologically-Inspired Forgetting, Cognitive Quantization, and Multi-Channel Retrieval for Zero-LLM Agent Memory Systems

AI coding agents operate in a paradox: they possess vast parametric knowledge yet cannot remember a conversation from an hour ago. Existing memory syste...

Supervised Fine-Tuning

Learn how to adapt pretrained LLMs to specific tasks through supervised fine-tuning - data preparation, hyperparameters, catastrophic forgetting, and evaluation.

Supply Chain AI

Lead time prediction, supplier risk scoring, demand sensing, disruption detection, route optimization, and the ML systems that build resilient and efficient retail supply chains.

Supply Chain Optimization with AI

Learn how AI transforms supply chain management through probabilistic demand forecasting, supplier risk scoring, inventory optimization, disruption detection, and vehicle routing.

SurvHTE-Bench: A Benchmark for Heterogeneous Treatment Effect Estimation in Survival Analysis

Estimating heterogeneous treatment effects (HTEs) from right-censored survival data is critical in high-stakes applications such as precision medicine a...

SVGS: Enhancing Gaussian Splatting Using Primitives with Spatially Varying Colors

Gaussian Splatting demonstrates impressive results in multi-view reconstruction based on Gaussian explicit representations. However, the current Gaussia...

SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue

Zero-shot text-to-speech (TTS) has improved substantially for single-speaker synthesis, yet expressive long-form multi-speaker dialogue remains difficul...

SWE-AGILE: A Software Agent Framework for Efficiently Managing Dynamic Reasoning Context

Prior representative ReAct-style approaches in autonomous Software Engineering (SWE) typically lack the explicit System-2 reasoning required for deep an...

SWE-bench and Evaluation

How to evaluate coding agents: SWE-bench, SWE-bench Verified, SOTA numbers, failure modes, and building your own evaluation harness.

SWE-bench Verified

SWE-bench Verified is the gold standard for evaluating coding agents on real GitHub issues. Learn the evaluation methodology, Docker harness, failure mode taxonomy, and how to interpret benchmark scores.

SWE-chat: Coding Agent Interactions From Real Users in the Wild

AI coding agents are being adopted at scale, yet we lack empirical evidence on how people actually use them and how much of their output is useful in pr...

SWE-Review: Closing the Loop on Issue Resolution with Agentic Code Review

Coding agents increasingly generate pull requests (PRs) for real-world software issues, yet one-shot PR generation remains open-loop: the PR is proposed...

SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies

The emergence of 'vibe coding' platforms, where users describe applications in natural language and AI agents autonomously generate full-stack software,...

SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation

High-resolution image-to-video (I2V) generation aims to synthesize realistic temporal dynamics while preserving fine-grained appearance details of the i...

Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models

Vision-Language Models (VLMs) have shown remarkable capabilities in joint vision-language understanding, but their large scale poses significant challen...

Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility

AI agents that interact with their environments through tools enable powerful applications, but in high-stakes business settings, unintended actions can...

SymptomAI: Towards a Conversational AI Agent for Everyday Symptom Assessment

Language models excel at diagnostic assessments on currated medical case-studies and vignettes, performing on par with, or better than, clinical profess...

Synchronous vs Asynchronous Inference

When to use synchronous versus asynchronous inference patterns for ML systems - queue architectures, streaming, timeout handling, and production trade-offs.

SynthDocBench: Controlled Benchmark for Long-Context Visual Document Understanding

Vision language models (VLMs) have achieved strong performance on visual document understanding benchmarks such as DocVQA, ChartQA, and MMLongBench-Doc....

Synthetic Computers at Scale for Long-Horizon Productivity Simulation

Realistic long-horizon productivity work is strongly conditioned on user-specific computer environments, where much of the work context is stored and or...

Synthetic Data and Self-Improvement

Generating high-quality synthetic training data with LLMs using Evol-Instruct, Self-Instruct, Constitutional AI, rejection sampling, and self-play techniques to build data flywheels without expensive human annotation.

Synthetic Data for RAG

Generating question-answer pairs, evaluation datasets, and retrieval test cases from documents to build, evaluate, and systematically improve RAG systems.

Synthetic data in cryptocurrencies using generative models

Data plays a fundamental role in consolidating markets, services, and products in the digital financial ecosystem. However, the use of real data, especi...

Synthetic Monitoring Environments for Reinforcement Learning

Reinforcement Learning (RL) lacks benchmarks that enable precise, white-box diagnostics of agent behavior. Current environments often entangle complexit...

Synthetic Sandbox for Training Machine Learning Engineering Agents

As large language model agents advance beyond software engineering (SWE) tasks toward machine learning engineering (MLE), verifying agent behavior becom...

sys and inspect - Runtime Introspection at Engineering Depth

Master the sys and inspect modules at engineering depth - sys.argv, sys.path, sys.modules cache, sys.settrace, sys._getframe, inspect.signature with all parameter kinds, inspect.getsource, inspect.stack, and how FastAPI, pytest, and click use these modules to build their core features.

System Calls and Linux API

Learn how Linux system calls underpin every ML workload - from dataset loading with mmap to epoll-based inference servers, seccomp sandboxing, and io_uring async I/O.

System Prompts and Context Design

Master the architecture of LLM conversations - how to design system prompts, manage context windows, and build production-grade context management systems.

System Prompts and Personas

Design production-grade system prompts and AI personas - the 6-component anatomy, dynamic context injection, behavioral constraints, tone configuration, and persona stability testing.

t-SNE and UMAP

Non-linear dimensionality reduction with t-SNE and UMAP - crowding problem, KL divergence optimization, perplexity, Barnes-Hut approximation, UMAP topological foundations, and production-safe usage.

T^2PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

Recent progress in multi-turn reinforcement learning (RL) has significantly improved reasoning LLMs' performances on complex interactive tasks. Despite...

TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding

Foundation models have established unified representations for natural language processing, yet this paradigm remains largely unexplored for tabular dat...

TableCoder: Table Extraction from Text via Reliable Code Generation.

TableCoder: Table Extraction from Text via Reliable... - published at ACL 2025.

Tadabur: A Large-Scale Quran Audio Dataset

Despite growing interest in Quranic data research, existing Quran datasets remain limited in both scale and diversity. To address this gap, we present T...

TAIHRI: Task-Aware 3D Human Keypoints Localization for Close-Range Human-Robot Interaction

Accurate 3D human keypoints localization is a critical technology enabling robots to achieve natural and safe physical interaction with users. Conventio...

Take Out Your Calculators: Estimating the Real Difficulty of Question Items with LLM Student Simulations.

Take Out Your Calculators: Estimating the Real Diffi... — published at ACL 2026.

Takeuchi's Information Criteria as Generalization Measures for DNNs Close to NTK Regime

Generalization measures have been studied extensively in the machine learning community to better characterize generalization gaps. However, establishin...

Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation

Modern optimizers like Adam and Muon are central to training large language models, but their reliance on first- and second-order momenta introduces sig...

Target Policy Optimization

In RL, given a prompt, we sample a group of completions from a model and score them. Two questions follow: which completions should gain probability mas...

Target-Oriented Pretraining Data Selection via Neuron-Activated Graph

Everyday tasks come with a target, and pretraining models around this target is what turns them into experts. In this paper, we study target-oriented la...

Task-Focused Memorization for Multimodal Agents

Long-term memory is essential for multimodal agents to build coherent experience, accumulate world knowledge, and achieve continual learning. However, c...

Task-Specific Evaluation Design

Building evaluation suites tailored to your production use case - test set curation, annotation, metric selection, LLM-as-judge, and automated scoring pipelines that actually predict deployment quality.

TC-AE: Unlocking Token Capacity for Deep Compression Autoencoders

We propose TC-AE, a ViT-based architecture for deep compression autoencoders. Existing methods commonly increase the channel number of latent representa...

TCDA: Thread-Constrained Discourse-Aware Modeling for Conversational Sentiment Quadruple Analysis

Conversational Aspect-based Sentiment Quadruple Analysis (DiaASQ) needs to capture the complex interrelationships in multiple rounds of dialogues. Exist...

TCP/IP Fundamentals for ML

Master the networking layer that underpins every distributed training run and ML serving system - from TCP handshakes to jumbo frames and congestion control algorithms used in modern GPU clusters.

TDD Principles - Write the Test First, Let Failure Guide the Design

Master Test-Driven Development at engineering depth - the Red-Green-Refactor cycle, the three laws of TDD, worked BankAccount example, test naming, the test pyramid, London vs Detroit schools, and when TDD surfaces design problems before production.

Teaching LLMs a Low-Resource Language: Enhancing Code Completion in Pharo

Large Language Models (LLMs) unlocked new possibilities in automated code writing, becoming the backbone of most code completion tools. While LLMs excel...

Technical Debt in ML Systems

The seven categories of hidden technical debt unique to machine learning systems - entanglement, hidden feedback loops, pipeline jungles, configuration debt, and how to detect and remediate them.

TelAgentBench: A Multi-faceted Benchmark for Evaluating LLM-based Agents in Telecommunications.

TelAgentBench: A Multi-faceted Benchmark for Evaluat... - published at EMNLP 2025.

Tell Me What To Learn: Generalizing Neural Memory to be Controllable in Natural Language

Modern machine learning models are deployed in diverse, non-stationary environments where they must continually adapt to new tasks and evolving knowledg...

TEMPO: Scaling Test-time Training for Large Reasoning Models

Test-time training (TTT) adapts model parameters on unlabeled test instances during inference time, which continuously extends capabilities beyond the r...

Temporal Convolutional Networks (TCNs)

Master Temporal Convolutional Networks - causal and dilated convolutions, receptive field math, residual blocks, and when TCNs outperform LSTMs and Transformers in production sequence modeling.

Temporal Data Requirement for Predicting Unplanned Hospital Readmissions

With the proliferation of Electronic Health Records (EHRs), a critical challenge in building predictive models is determining the optimal historical dat...

Temporal Features for Real-Time ML

Engineering time-based features for real-time ML - recency-weighted features, session features, sliding window aggregations, point-in-time joins, temporal leakage prevention, and clock skew in distributed systems.

Temporally Extended Mixture-of-Experts Models

Mixture-of-Experts models, now popular for scaling capacity at fixed inference speed, switch experts at nearly every token. Once a model outgrows availa...

Tensor and Pipeline Parallelism

Learn how tensor parallelism splits weight matrices across GPUs and pipeline parallelism splits model layers, enabling inference and training of models too large for a single GPU.

Tensor Core Programming

Program NVIDIA Tensor Cores directly using the WMMA API, MMA PTX instructions, Triton tl.dot(), and CUTLASS - understand activation requirements, shape constraints, and how to diagnose zero Tensor Core utilization.

Tensor Cores and Mixed Precision

How tensor cores accelerate matrix multiply, BF16 vs FP16 vs FP8 vs TF32, mixed precision training implementation, and the performance impact of precision choices.

TensorRT and Inference Optimization

NVIDIA TensorRT compilation pipeline, layer fusion, precision calibration, kernel auto-tuning, and deploying optimized inference engines for production LLM and computer vision workloads.

Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories

We release Terminal Wrench, a subset of 331 terminal-agent benchmark environments, copied from the popular open benchmarks that are demonstrably reward-...

Terraform for ML Infrastructure

Build complete ML platforms with Terraform - GPU clusters, MLflow, EKS, feature stores, and model registries using production-grade HCL modules.

Terraform Fundamentals

Master Terraform core concepts - providers, resources, state management, modules, and the plan/apply lifecycle for building reproducible ML infrastructure.

TESSERA v2: Scaling Pixel-wise Earth Foundation Models

Pixel-wise Earth-observation (EO) foundation models are now achieving state-of-the-art performance via generated spatial embeddings. However, how these...

Test-Driven Agent Loops

The most powerful technique for coding agents: use test output as the ground truth feedback signal. TDD loops, pytest integration, output parsing, and backtracking.

Test-Time Adaptation for EEG Foundation Models: A Systematic Study under Real-World Distribution Shifts

Electroencephalography (EEG) foundation models have shown strong potential for learning generalizable representations from large-scale neural data, yet...

Test-Time Compute - Scaling at Inference

The paradigm shift from training-time scaling to inference-time scaling - best-of-N sampling, majority voting, and how spending more compute at inference improves reasoning quality.

Testing and Monitoring Pipelines

Unit testing DAGs, SLA monitoring, and alerting on pipeline failures.

Testing Data Pipelines for ML Correctness

How to test batch ML data pipelines with unit tests, integration tests, and data quality checks - catching label leakage, schema drift, and idempotency bugs before they corrupt your models.

Testing Full Mediation of Treatment Effects and the Identifiability of Causal Mechanisms

In causal analysis, understanding the causal mechanisms through which an intervention or treatment affects an outcome is often of central interest. We p...

Testing ML Code

Build a practical ML test suite from zero - covering the full pyramid from unit tests through model validation without testing everything.

Text Features for ML

Turning text into ML features - from TF-IDF baselines to embedding-based representations that improved e-commerce search NDCG by 18%.

Text-Attributed Graph Learning with Coupled Augmentations.

Text-Attributed Graph Learning with Coupled Augmenta... - published at COLING 2025.

TextLDM: Language Modeling with Continuous Latent Diffusion

Diffusion Transformers (DiT) trained with flow matching in a VAE latent space have unified visual generation across images and videos. A natural next st...

TGI and Alternative Serving Frameworks

Compare HuggingFace TGI, Ollama, LiteLLM, Triton Inference Server, and llama.cpp for LLM deployment - feature analysis, performance benchmarks, and when to use each framework.

The $\mathbf{Y}$-Combinator for LLMs: Solving Long-Context Rot with $λ$-Calculus

LLMs are increasingly used as general-purpose reasoners, but long inputs remain bottlenecked by a fixed context window. Recursive Language Models (RLMs)...

The 12-Factor App - Building Deployable Python Apps

Apply the 12-Factor App methodology to Python applications with FastAPI, Docker, and PostgreSQL - covering all 12 factors with production-ready code examples.

The Agent Loop: Observe, Think, Act

Master the Observe-Think-Act loop that drives every AI agent - from the detailed mechanics of each phase to error handling, backtracking, and token management.

The Alignment Problem

Why making AI systems do what we actually want is harder than it looks - the specification problem, Goodhart's Law, reward hacking, and outer vs inner alignment.

The Bernstein-von Mises theorem for Bayesian one-pass online learning

Bayesian online learning provides a coherent framework for sequential inference. However, its theoretical understanding remains limited, particularly in...

The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents

Computer-use agents (CUAs) can now autonomously complete complex tasks in real digital environments, but when misled, they can also be used to automate...

The Challenge of Attention at Long Contexts

Why attention is O(n²) in memory and compute, how the KV cache grows with context length, and how FlashAttention solves the IO bottleneck without changing the algorithm.

The Cognitive Penalty: Ablating System 1 and System 2 Reasoning in Edge-Native SLMs for Decentralized Consensus

Decentralized Autonomous Organizations (DAOs) are inclined explore Small Language Models (SLMs) as edge-native constitutional firewalls to vet proposals...

The Cold Start Problem - When Your Recommender Knows Nothing

How to recommend to new users and new items when collaborative filtering has no interaction history - the cold start problem and its production solutions.

The Compression Gap: Why Discrete Tokenization Limits Vision-Language-Action Model Scaling

Scaling Vision-Language-Action (VLA) models by upgrading the vision encoder is expected to improve downstream manipulation performance--as it does in vi...

The Continuity Layer: Why Intelligence Needs an Architecture for What It Carries Forward

The most important architectural problem in AI is not the size of the model but the absence of a layer that carries forward what the model has come to u...

The Data Engineering Landscape for AI Teams

What data engineers actually do in AI organizations, how data flows from raw sources to model serving, and when the data layer becomes the bottleneck for machine learning.

The Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning

The viability of chain-of-thought (CoT) monitoring hinges on models being unable to reason effectively in their latent representations. Yet little is kn...

The Dynamic-Probabilistic Consistency Gap in Chaotic Surrogate Modeling

Dynamical systems reconstruction (DSR) aims to learn surrogate models that capture the dynamics underlying time-series data. Reliably deploying these su...

The First Token Knows: Single-Decode Confidence for Hallucination Detection

Self-consistency detects hallucinations by generating multiple sampled answers to a question and measuring agreement, but this requires repeated decodin...

The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement

Building strong reward models (RMs) for language model alignment is bottlenecked by the cost and difficulty of acquiring diverse and reliable preference...

The Fragility of Chain-of-Thought Monitoring Across Typologically Diverse Languages

Chain-of-thought (CoT) monitoring has been proposed as a promising safety mechanism for detecting misaligned behavior in large language models. However,...

The functools Module - lru_cache, partial, reduce, singledispatch and More

Master the entire functools module at engineering depth - LRU cache internals and eviction, wraps, partial and partialmethod, reduce with operator, total_ordering, cached_property, singledispatch and singledispatchmethod, thread safety considerations, and real-world usage patterns.

The Geometric Alignment Tax: Tokenization vs. Continuous Geometry in Scientific Foundation Models

Foundation models for biology and physics optimize predictive accuracy, but their internal representations systematically fail to preserve the continuou...

The Geometric Canary: Predicting Steerability and Detecting Drift via Representational Stability

Reliable deployment of language models requires two capabilities that appear distinct but share a common geometric foundation: predicting whether a mode...

The GIL Explained - What It Is, What It Isn't, and How to Work Around It

Master Python's Global Interpreter Lock at engineering depth - what the GIL protects, why counter += 1 is not atomic, the check interval, I/O vs CPU-bound threading, multiprocessing, C extensions that release the GIL, and Python 3.13 free-threaded mode.

The Good, the Bad, and the Ugly of Markov Boundary for Tabular Prediction

Under standard graphical assumptions, the Markov boundary of a target variable is the smallest set of features that renders every other feature redundan...

The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models

Large language models (LLMs) are routinely prompted to take on social roles ranging from individuals to institutions, yet it remains unclear whether the...

The Harder Path: Last Iterate Convergence for Uncoupled Learning in Zero-Sum Games with Bandit Feedback

We study the problem of learning in zero-sum matrix games with repeated play and bandit feedback. Specifically, we focus on developing uncoupled algorit...

The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation

On-policy distillation (OPD) is an increasingly important paradigm for post-training language models. However, we identify a pervasive Scaling Law of Mi...

The Invalsi Benchmarks: measuring the Linguistic and Mathematical understanding of Large Language Models in Italian.

The Invalsi Benchmarks: measuring the Linguistic and... - published at COLING 2025.

The Iterator Protocol - How Python's for Loop Really Works

Master Python's iterator protocol at engineering depth - __iter__, __next__, StopIteration, the iterable vs iterator distinction, for-loop desugaring, iter() with sentinel, next() with default, and lazy pipelines with itertools.

The Last Human-Written Paper: Agent-Native Research Artifacts

Scientific publication compresses a branching, iterative research process into a linear narrative, discarding the majority of what was discovered along...

The LLM Language Network: A Neuroscientific Approach for Identifying Causally Task-Relevant Units.

The LLM Language Network: A Neuroscientific Approach... - published at NAACL 2025.

The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment

We investigate whether post-trained capabilities can be transferred across models without retraining, with a focus on transfer across different model sc...

The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence

We introduce the MiniMax-M2 series, a family of Mixture-of-Experts language models built around the principle that mini activations can unleash maximum...

The ML Lifecycle

The complete end-to-end lifecycle of a machine learning model, from problem definition through deployment, monitoring, and eventual retirement - with feedback loops, governance, and retraining triggers.

The ML System Design Framework

A structured 4-step framework for approaching ML system design interviews and real production projects - from requirements to deep dive.

The MLOps Lifecycle

Understand the end-to-end MLOps lifecycle, maturity levels 0–3, the nine components of production ML, and why ML deployment is categorically different from software deployment.

The monotonicity of the Franz-Parisi potential is equivalent with Low-degree MMSE lower bounds

Over the last decades, two distinct approaches have been instrumental to our understanding of the computational complexity of statistical estimation. Th...

The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping

Despite the success of reinforcement learning for large language models, a common failure mode is reduced sampling diversity, where the policy repeatedl...

The Probabilistic Perspective on ML - Learning as Bayesian Inference

How Bayesian inference unifies all of machine learning under one framework: prior beliefs, observed evidence, and posterior distributions over model parameters.

The Python Import System - importlib, Finders, Loaders, and Import Hooks

Master the Python import system at engineering depth - sys.modules cache, import resolution order, finders and loaders, importlib.import_module, relative vs absolute imports, __init__.py, __all__, circular imports, custom import hooks, and importlib.reload.

The ReAct Pattern

Master the ReAct (Reasoning + Acting) pattern - the 2022 breakthrough that grounds LLM reasoning in real observations and prevents hallucination in agents.

The Role of Handling Attributive Nouns in Improving Chinese-To-English Machine Translation.

The Role of Handling Attributive Nouns in Improving... - published at COLING 2025.

The Scaling Properties of Implicit Deductive Reasoning in Transformers

We investigate the scaling properties of implicit deductive reasoning over Horn clauses in depth-bounded Transformers. By systematically decorrelating p...

The Second Tiny Papers Track at ICLR 2024, Tiny Papers @ ICLR 2024, Vienna, Austria, May 11, 2024

The Second Tiny Papers Track at ICLR 2024, Tiny Papers @ ICLR 2024, Vienna, Austria, May 11, 2024 - published at Tiny Papers @ ICLR 2024.

The Stability of Online Algorithms in Performative Prediction

The use of algorithmic predictions in decision-making leads to a feedback loop where the models we deploy actively influence the data distributions we s...

The TTS-STT Flywheel: Synthetic Entity-Dense Audio Closes the Indic ASR Gap Where Commercial and Open-Source Systems Fail

Niche-domain Indic ASR -- digit strings, currency amounts, addresses, brand names, English/Indic codemix -- is under-served by both open-source SOTA and...

Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

Reward models (RMs) have become an indispensable fixture of the language model (LM) post-training playbook, enabling policy alignment and test-time scal...

Thermodynamic Response Functions in Singular Bayesian Models

Singular statistical models-including mixtures, matrix factorization, and neural networks-violate regular asymptotics due to parameter non-identifiabili...

Theta-regularized Kriging: Modelling and Algorithms

To obtain more accurate model parameters and improve prediction accuracy, we proposed a regularized Kriging model that penalizes the hyperparameter thet...

Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning

Humans paint images incrementally: they plan a global layout, sketch a coarse draft, inspect, and refine details, and most importantly, each step is gro...

Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling

Recent advances in generative video models are increasingly driven by post-training and test-time scaling, both of which critically depend on the qualit...

Thinking Before Constraining: A Unified Decoding Framework for Large Language Models

Natural generation allows Large Language Models (LLMs) to produce free-form responses with rich reasoning, yet the lack of structure makes outputs diffi...

Thinking with DistilQwen: A Tale of Four Distilled Reasoning and Reward Model Series.

Thinking with DistilQwen: A Tale of Four Distilled R... - published at EMNLP 2025.

ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement

We introduce ThinkTwice, a simple two-phase framework that jointly optimizes LLMs to solve reasoning problems and refine the answers, based on Group Rel...

Thread Blocks, Warps, and Grids

Master the CUDA thread hierarchy - threads, warps, blocks, and grids - how they map to physical hardware, how to calculate global thread indices for 1D, 2D, and 3D problems, and how to choose block dimensions for maximum SM occupancy.

Three-Phase Transformer

We present Three-Phase Transformer (3PT), a residual-stream structural prior for decoder-only Transformers on a standard SwiGLU + RMSNorm + RoPE + GQA b...

ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention

Efficient attention algorithms are critical to mitigate the quadratic cost of attention in long-context workloads. Prior work utilises block-scaled quan...

TIDE: Every Layer Knows the Token Beneath the Context

We revisit a universally accepted but under-examined design choice in every modern LLM: a token index is looked up once at the input embedding layer and...

TIES Merging - Resolving Sign Conflicts

How TIES-Merging eliminates task interference by trimming small deltas, electing signs by majority vote, and merging only aligned parameters.

Tiling and Shared Memory Optimization

How tiled matrix multiply reduces HBM traffic by reusing data in shared memory, optimal tile size selection, double buffering with cp.async, and applying the tiling pattern to attention and convolution.

Time is Not a Label: Continuous Phase Rotation for Temporal Knowledge Graphs and Agentic Memory

Structured memory representations such as knowledge graphs are central to autonomous agents and other long-lived systems. However, most existing approac...

Time Series Forecasting Patterns

Master the core patterns, classical methods, and deep learning approaches for time series forecasting - including the most critical mistake practitioners make with train/test splits.

Time Series Foundation Models as Strong Baselines in Transportation Forecasting: A Large-Scale Benchmark Analysis

Accurate forecasting of transportation dynamics is essential for urban mobility and infrastructure planning. Although recent work has achieved strong pe...

Time-Series Features

Feature engineering for temporal data - lag features, rolling statistics, Fourier seasonality, and preventing temporal leakage that destroys production forecasts.

TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale

Real-time detection and mitigation of technical anomalies are critical for large-scale cloud-native services, where even minutes of downtime can result...

TIP: Token Importance in On-Policy Distillation

On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter...

TIPA: Typologically Informed Parameter Aggregation.

TIPA: Typologically Informed Parameter Aggregation. - published at EACL 2026.

Token Cost Monitoring

Monitor and control LLM API costs in production - cost-per-request dashboards, budget alerts, token efficiency optimization, cost attribution by feature and user, and anomaly detection.

Token Time Continuous Diffusion for Language Modeling

In this paper we introduce token time continuous diffusion (TTCD), a new diffusion language model which (a) operates in continuous space, deterministica...

Token-Based Dual-view Fusion and Adaptation of Large Vision Models for Breast Cancer Classification

Accurate breast cancer classification from mammography requires effective integration of complementary information from craniocaudal (CC) and mediolater...

Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection

We show that LoRA adapters, the dominant distribution format for fine-tuned LLMs, can be reliably backdoored through training data poisoning while prese...

Tokenization Deep Dive

How tokenizers convert raw text to token IDs - BPE from scratch, WordPiece, byte-level BPE, and the surprising ways tokenization breaks models.

Tool Use and Function Calling

Master how AI agents call tools - from JSON schema definitions to parallel execution, error handling, and the tool design principles that make agents reliable.

Tool Use and Function Calling

Enabling LLMs to invoke external tools and APIs through structured function calling, covering JSON schema design, Anthropic vs OpenAI formats, parallel tool calls, and production safety.

Tool Use for Coding

Complete coding agent tool set: file operations, bash execution, search, git integration, LSP queries - full implementations with safety and error handling.

TopBench: A Benchmark for Implicit Prediction and Reasoning over Tabular Question Answering

Large Language Models (LLMs) have advanced Table Question Answering, where most queries can be answered by extracting information or simple aggregation....

torch.compile and XLA

Deep dive into PyTorch's torch.compile architecture - TorchDynamo graph capture, AOTAutograd, TorchInductor code generation, XLA for TPU/GPU, and when compiler-based optimization delivers real ML performance gains.

TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training

Recent advances in unified multimodal models (UMMs) have led to a proliferation of architectures capable of understanding, generating, and editing acros...

Toward Autonomous Long-Horizon Engineering for ML Research

Autonomous AI research has advanced rapidly, but long-horizon ML research engineering remains difficult: agents must sustain coherent progress across ta...

Toward Generative Quantum Utility via Correlation-Complexity Map

We propose a Correlation-Complexity Map as a practical diagnostic tool for determining when real-world data distributions are structurally aligned with...

Toward Native Multimodal Modeling: A Roadmap

Multimodal modeling represents a vital step from modality-agnostic reasoning toward world modeling. While early approaches predominantly rely on late-fu...

Toward World Models for Epidemiology

World models have emerged as a unifying paradigm for learning latent dynamics, simulating counterfactual futures, and supporting planning under uncertai...

Towards Autonomous and Auditable Medical Imaging Model Development

Large language model (LLM) agents are beginning to automate machine learning engineering (MLE) by coupling planning, code execution, debugging, and empi...

Towards Autonomous Mechanistic Reasoning in Virtual Cells

Large language models (LLMs) have recently gained significant attention as a promising approach to accelerate scientific discovery. However, their appli...

Towards Customized Multimodal Role-Play

Unified multimodal understanding and generation models enable richer human-AI interaction. Yet jointly customizing a character's persona, dialogue style...

Towards Evaluation Engineering: An Empirical Study of ML Evaluation Harnesses in the Wild

Evaluation harnesses are software systems that orchestrate model evaluation by managing model invocation, data loading, metric computation, and result r...

Towards Faithful Multimodal Concept Bottleneck Models

Concept Bottleneck Models (CBMs) are interpretable models that route predictions through a layer of human-interpretable concepts. While widely studied i...

Towards Long-horizon Agentic Multimodal Search

Multimodal deep search agents have shown great potential in solving complex tasks by iteratively collecting textual and visual evidence. However, managi...

Towards Mechanistically Understanding Why Memorized Knowledge Fails to Generalize in Large Language Model Finetuning

Fine-tuning LLMs to inject new knowledge faces a critical challenge: LLMs can quickly memorize new facts, yet fail to use them for downstream reasoning...

Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings.

Towards Mitigating Hallucinations in Large Vision-La... — published at ACL 2026.

Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

The emergence of Large Language Models (LLMs) has illuminated the potential for a general-purpose user simulator. However, existing benchmarks remain co...

Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer

Real-time and accurate spatial audio generation is pivotal for delivering an immersive experience. However, existing spatial audio synthesis technologie...

Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation

Large Language Models (LLMs) have advanced autonomous agents from deep search, which retrieves concise factual answers, to deep research, which synthesi...

TPU Architecture and Use

Deep dive into Google TPU v4/v5 architecture, systolic arrays, XLA compilation, TPU pods, JAX programming model, cost comparison with GPUs, and when TPUs outperform GPU clusters.

TRACE: Capability-Targeted Agentic Training

Large Language Models (LLMs) deployed in agentic environments must exercise multiple capabilities across different task instances, where a capability is...

Traceprop: End-to-End Provenance-Guided Data Attribution for Auditable ML

The first unified system connecting raw source files through preprocessing and training to individual predictions, with gradient-based attribution and provenance-guided machine unlearning. Sub-1% lineage overhead, 266x faster than TRAK on CPU, exceeds retrain-from-scratch unlearning gold standard.

TRACER: Trace-Based Adaptive Cost-Efficient Routing for LLM Classification

Every call to an LLM classification endpoint produces a labeled input-output pair already retained in production logs. These pairs constitute a free, gr...

Tracing Agentic Failure from the Flow of Success

Failure attribution for LLM-based agentic systems, i.e., identifying which steps in a failure trajectory caused the task to fail, is critical for debugg...

Tracing LLM Applications

What tracing means for LLM apps - capturing every prompt, completion, latency, cost, and error in queryable traces. Why traditional APM fails for AI, OpenTelemetry GenAI semantic conventions, and a complete production-grade tracer implementation.

Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs

Post-training data plays a pivotal role in shaping the capabilities of Large Language Models (LLMs), yet datasets are often treated as isolated artifact...

Training a Student Expert via Semi-Supervised Foundation Model Distillation

Foundation models deliver strong perception but are often too computationally heavy to deploy, and adapting them typically requires costly annotations....

Training Cost Optimization

Reduce model training costs by 60–80% through spot instances, gradient checkpointing, mixed precision, and compute-optimal training - without sacrificing accuracy.

Training Cost Optimization

Reducing ML training costs systematically - spot instances, mixed precision, gradient checkpointing, compute-optimal training (Chinchilla), and distributed training overhead.

Training Data Preparation for Fine-Tuning

Building high-quality data pipelines for LoRA fine-tuning - chat templates, instruction masking, deduplication, quality filtering, synthetic data generation, and dataset formats that actually produce good models.

Training Dynamics and Debugging

Systematic debugging toolkit for neural network training - loss landscape geometry and flat minima, gradient flow analysis with per-layer norm plots, learning rate finder algorithm, cyclical LR and warmup schedules, gradient clipping strategies, NaN detection hooks, TensorBoard and W&B integration patterns, and a complete pre-training checklist with runnable code.

Training Infrastructure at Scale

Build fault-tolerant GPU training clusters with InfiniBand, NCCL collective operations, Slurm and Kubernetes job scheduling, elastic training, and automatic checkpointing for multi-day training runs.

Training Jobs on Kubernetes

Running ML training on Kubernetes - Jobs, CronJobs, PyTorchJob and TFJob with the Training Operator, fault tolerance, checkpoint-based recovery, spot node handling, and distributed training patterns.

Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration

Most agents today ``self-evolve'' by following rewards and rules defined by humans. However, this process remains fundamentally dependent on external su...

Training MoE Models

How to train Mixture of Experts models at scale - expert parallelism, capacity factors, token dropping, load imbalance, training instability, and the GShard approach to 600B parameters.

Trajectory Evaluation

Evaluating the full action sequence, not just the final output - trajectory metrics, automatic scoring, and comparing agent versions.

Transfer Learning for Meta-analysis Under Covariate Shift

Randomized controlled trials often do not represent the populations where decisions are made, and covariate shift across studies can invalidate standard...

Tree-of-Thought Prompting

Explore multiple reasoning paths simultaneously using Tree-of-Thought - the technique that enables LLMs to backtrack, evaluate alternatives, and solve problems that defeat linear chain-of-thought.

TREK: Distill to Explore, Reinforce to Refine

Group Relative Policy Optimization (GRPO) is effective when the current policy already samples useful reasoning trajectories, but it stalls on hard prom...

TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration

While Large Language Models (LLMs) have empowered AI research agents to perform isolated scientific tasks, automating complex, real-world workflows, suc...

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

Extended reasoning in large language models (LLMs) creates severe KV cache memory bottlenecks. Leading KV cache compression methods estimate KV importan...

Triplet-Block Diffusion RWKV

Causal Transformer language models suffer from strictly sequential decoding and a quadratic per-step attention cost. While linear-time causal models and...

TriSplat: Simulation-Ready Feed-Forward 3D Scene Reconstruction

Sparse-view 3D reconstruction is increasingly addressed with feed-forward splatting networks that predict explicit primitives directly from images. Yet...

Triton for Custom Kernels

Write production GPU kernels in Python with OpenAI Triton - learn the tile-based programming model, core primitives, and how to implement softmax, layer norm, GEMM, and custom attention kernels that match CUDA performance.

Triton Inference Server and TorchServe

Production-grade ML serving frameworks - NVIDIA Triton's dynamic batching and multi-backend support, TorchServe's PyTorch-native serving, and when to use each.

Trojan horse hunt in deep forecasting models: Insights from the European Space Agency competition

Forecasting plays a crucial role in modern safety-critical applications, such as space operations. However, the increasing use of deep forecasting model...

Trust but Verify: Introducing DAVinCI -- A Framework for Dual Attribution and Verification in Claim Inference for Language Models

Large Language Models (LLMs) have demonstrated remarkable fluency and versatility across a wide range of NLP tasks, yet they remain prone to factual ina...

Trust Region Policy Distillation

Big goals are hard to achieve all at once; breaking them into small steps is wiser. We present Trust Region Policy Distillation (TOP-D), which transform...

Trust-Region Behavior Blending for On-Policy Distillation

On-policy distillation (OPD) trains a student on prefixes sampled from its own policy while matching a stronger teacher. This addresses the prefix misma...

TRUSTEVAL: A Dynamic Evaluation Toolkit on Trustworthiness of Generative Foundation Models.

TRUSTEVAL: A Dynamic Evaluation Toolkit on Trustwort... - published at NAACL 2025.

Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items

Recent advances in image generation and editing have opened new opportunities for virtual try-on. However, existing methods still struggle to meet compl...

TT-SI: Self-Improving LLM Agents with Test-Time Training.

TT-SI: Self-Improving LLM Agents with Test-Time Trai... — published at ACL 2026.

TT4D: A Pipeline and Dataset for Table Tennis 4D Reconstruction From Monocular Videos

We present TT4D, a large-scale, high-fidelity table tennis dataset. It provides 140+ hours of reconstructed singles and doubles gameplay from monocular...

Tunable Soft Equivariance with Guarantees

Equivariance is a fundamental property in computer vision models, yet strict equivariance is rarely satisfied in real-world data, which can limit a mode...

Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization

The rise of autonomous GUI agents has triggered adversarial countermeasures from digital platforms, yet existing research prioritizes utility and robust...

Turning Drift into Constraint: Robust Reasoning Alignment in Non-Stationary Environments

This paper identifies a critical yet underexplored challenge in reasoning alignment from multiple multi-modal large language models (MLLMs): In non-stat...

Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models

Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state-of-the-art dLLMs require billions of parameters for...

Turning Trust to Transactions: Tracking Affiliate Marketing and FTC Compliance in YouTube's Influencer Economy

YouTube has evolved into a powerful platform that where creators monetize their influence through affiliate marketing, raising concerns about transparen...

TurnOPD: Making On-Policy Distillation Turn-Aware for Efficient Long-Horizon Agent Training

On-policy distillation (OPD) trains a student policy by matching a stronger teacher on the student's own trajectories, offering a promising framework fo...

Two-Time-Scale Learning Dynamics: A Population View of Neural Network Training

Population-based learning paradigms, including evolutionary strategies, Population-Based Training (PBT), and recent model-merging methods, combine fast...

Two-Tower Models

How dual encoder architectures power billion-scale recommendation and search by separating user and item representations and querying them with approximate nearest neighbor search.

Two-Tower Models - The Architecture Powering Google, TikTok, and YouTube

How two-tower neural networks enable billion-scale retrieval by learning separate user and item towers that can be precomputed for ultra-fast inference.

Type-Checked Compliance: Deterministic Guardrails for Agentic Financial Systems Using Lean 4 Theorem Proving

The rapid evolution of autonomous, agentic artificial intelligence within financial services has introduced an existential architectural crisis: large l...

Typology-Aware Multilingual Morphosyntactic Parsing with Joint Abstract Node Modeling.

Typology-Aware Multilingual Morphosyntactic Parsing... — published at ACL 2026.

U-Cast: A Surprisingly Simple and Efficient Frontier Probabilistic AI Weather Forecaster

AI-based weather forecasting now rivals traditional physics-based ensembles, but state-of-the-art (SOTA) models rely on specialized architectures and ma...

UbuntuGuard: A Culturally-Grounded Policy Benchmark for Equitable AI Safety in African Languages.

UbuntuGuard: A Culturally-Grounded Policy Benchmark... — published at ACL 2026.

UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

Uniform Discrete Diffusion Model (UDM) has recently emerged as a promising paradigm for discrete generative modeling; however, its integration with rein...

UI-Copilot: Advancing Long-Horizon GUI Automation via Tool-Integrated Policy Optimization

MLLM-based GUI agents have demonstrated strong capabilities in complex user interface interaction tasks. However, long-horizon scenarios remain challeng...

UI-KOBE: Knowledge-Oriented Behavior Exploration for Lightweight Graph-Guided GUI Agents

Recent advances in mobile GUI agents have shown strong potential for automating mobile tasks, but most effective systems still depend on large vision-la...

UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding

GUI grounding, which localizes interface elements from screenshots given natural language queries, remains challenging for small icons and dense layouts...

Uncertainty Quantification - Knowing What Your Model Doesn't Know

Calibration, reliability diagrams, Expected Calibration Error, temperature scaling, and the full toolkit for quantifying and correcting uncertainty in production ML models.

Uncertainty Quantification for Multimodal Large Language Models with Incoherence-adjusted Semantic Volume

Despite their capabilities, Multimodal Large Language Models (MLLMs) may produce plausible but erroneous outputs, hindering reliable deployment. Accurat...

Uncertainty Quantification Via the Posterior Predictive Variance

We use the law of total variance to generate multiple expansions for the posterior predictive variance. These expansions are sums of terms involving con...

Uncovering Physical Drivers of Dark Matter Halo Structures with Auxiliary-Variable-Guided Generative Models

Deep generative models (DGMs) compress high-dimensional data but often entangle distinct physical factors in their latent spaces. We present an auxiliar...

Understanding and Enforcing Weight Disentanglement in Task Arithmetic

Task arithmetic provides an efficient, training-free way to edit pre-trained models, yet lacks a fundamental theoretical explanation for its success. Th...

Understanding Data Temporality Impact on Large Language Models Pre-training

Large language models (LLMs) are typically trained on shuffled corpora, yielding models whose knowledge is frozen at train time and whose temporal groun...

Understanding Reasoning from Pretraining to Post-Training

Reinforcement learning (RL) has become central to improving large language models (LLMs) on complex reasoning tasks, yet RL post-training is largely stu...

Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models

The recent success of reinforcement learning (RL) in large reasoning models has inspired the growing adoption of RL for post-training Multimodal Large L...

Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator

Unified multimodal models integrating visual understanding and generation face a fundamental challenge: visual generation incurs substantially higher co...

UniClawBench: A Universal Benchmark for Proactive Agents on Real-World Tasks

The rapid development of large language models and multimodal large language models has accelerated the emergence of proactive agents capable of operati...

UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards

Retrieval-Augmented Generation (RAG) extends Large Vision-Language Models (LVLMs) with external visual knowledge. However, existing visual RAG systems t...

Unified Memory and Memory Pooling

How CUDA Unified Memory works under the hood, when it helps versus hurts performance, and how PyTorch's caching allocator and memory pools eliminate allocation overhead in production ML systems.

Unified Panoramic Geometry Estimation via Multi-View Foundation Models

Geometry estimation from perspective images has greatly advanced, maturing to the point where off-the-shelf foundation models are able to reconstruct 3D...

Uniform Diffusion Models Revisited: Leave-One-Out Denoiser and Absorbing State Reformulation

Discrete diffusion models are often trained through clean-data prediction, but the prediction can be used in different ways to define the reverse dynami...

Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity

Reinforcement Learning with Verifiable Rewards (RLVR) has achieved substantial gains in single-attempt accuracy (Pass@1) on reasoning tasks, yet often s...

Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training large language models. While Group Relative Polic...

UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection

In recent years, significant progress has been made in both image generation and generated image detection. Despite their rapid, yet largely independent...

UniMesh: Unifying 3D Mesh Understanding and Generation

Recent advances in 3D vision have led to specialized models for either 3D understanding (e.g., shape classification, segmentation, reconstruction) or 3D...

UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

Modern Mixture-of-Experts (MoE) architectures allocate expert capacity through a rigid per-layer rule: each transformer layer owns a separate expert set...

UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification

As large language models (LLMs) continue to advance rapidly, they are becoming increasingly capable while simultaneously demanding ever-longer context l...

UniSD: Towards a Unified Self-Distillation Framework for Large Language Models

Self-distillation (SD) offers a promising path for adapting large language models (LLMs) without relying on stronger external teachers. However, SD in a...

UniSteer: Text-Guided Flow Matching in Activation Space for Versatile LLM Steering

Activation-based control steers large language models (LLMs) by intervening on their internal representations during inference, and has emerged as an ef...

UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

Scaling humanoid foundation models is bottlenecked by the scarcity of robotic data. While massive egocentric human data offers a scalable alternative, b...

unittest - The Standard Library Test Framework

Master Python's unittest framework at engineering depth - TestCase lifecycle, all assertion methods, assertRaises as context manager, setUp/tearDown/setUpClass, unittest.mock.patch, TestSuite, skip decorators, and subtests for parametrised testing.

Universal Approximation Theorem

The Universal Approximation Theorem rigorously explained - Cybenko 1989, Hornik 1991, Leshno 1993, depth separation (Telgarsky 2015/2016), Barron's theorem, NTK, Lottery Ticket Hypothesis, double descent, and NumPy demonstrations of approximation quality vs width.

UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors

Recent progress has shown that video diffusion models (VDMs) can be repurposed for diverse multimodal graphics tasks. However, existing methods often tr...

UniVR: Thinking in Visual Space for Unified Visual Reasoning

Learning broad world knowledge directly from raw visual data is a fundamental capability of intelligence. We introduce UniVR, the first investigation in...

Unlocking Dense Metric Depth Estimation in VLMs

Vision-Language Models (VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text...

Unstructured Pruning

Weight-level sparsity, the Lottery Ticket Hypothesis, SparseGPT, Wanda, and 2:4 structured sparsity - why unstructured pruning is theoretically elegant but practically limited for LLMs.

Unsupervised Continual Learning for Amortized Bayesian Inference

Amortized Bayesian Inference (ABI) enables efficient posterior estimation using generative neural networks trained on simulated data, but often suffers...

UP: Unbounded Positive Asymmetric Optimization for Breaking the Exploration-Stability Dilemma

Reinforcement learning (RL) has become the standard paradigm for enhancing the complex reasoning capabilities of large language models (LLMs). To achiev...

UrBLiMP: A Benchmark for Evaluating the Linguistic Competence of Large Language Models in Urdu.

UrBLiMP: A Benchmark for Evaluating the Linguistic C... — published at ACL 2026.

Validation with Pydantic - Production Request and Response Models

Master Pydantic v2 at engineering depth - BaseModel, Field constraints, field and model validators, ORM mode, discriminated unions, partial updates for PATCH endpoints, JSON Schema generation, and the model_dump gotchas that silently corrupt production data.

Value Functions as Supermartingale Certificates

Certification methods for stochastic systems provide sufficient proof rules, based on real-valued supermartingale certificates, to determine the almost-...

Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision

We present Vanast, a unified framework that generates garment-transferred human animation videos directly from a single human image, garment images, and...

Var-JEPA: A Variational Formulation of the Joint-Embedding Predictive Architecture -- Bridging Predictive and Generative Self-Supervised Learning

The Joint-Embedding Predictive Architecture (JEPA) is often seen as a non-generative alternative to likelihood-based self-supervised learning, emphasizi...

Variational Autoencoders

Master Variational Autoencoders - ELBO derivation, reparameterization trick, β-VAE disentanglement, VQ-VAE discrete latent spaces, conditional VAE, and PyTorch implementation for MNIST generation and anomaly detection.

Variational Autoencoders - Learning Latent Distributions with Evidence Lower Bound

VAEs combine variational inference with neural networks to learn a probabilistic latent space - enabling generation, interpolation, and disentanglement.

Variational Garrote for Sparse Inverse Problems

Sparse regularization plays a central role in solving inverse problems arising from incomplete or corrupted measurements. Different regularizers corresp...

VaseMuseum: Digital Intelligent Museum for Ancient Greek Pottery

Vision-language models (VLMs) have made interactive digital museums increasingly feasible by connecting 3D digitization with natural-language artifact e...

VaSST: Variational Inference for Symbolic Regression using Soft Symbolic Trees

Symbolic regression has recently gained traction in AI-driven scientific discovery, aiming to recover explicit closed-form expressions from data that re...

VCRMNER: Visual Cue Refinement in Multimodal NER using CLIP Prompts.

VCRMNER: Visual Cue Refinement in Multimodal NER usi... - published at COLING 2025.

VecMol: Vector-Field Representations for 3D Molecule Generation

Generative modeling of three-dimensional (3D) molecules is a fundamental yet challenging problem in drug discovery and materials science. Existing appro...

Vector Databases

Compare Pinecone, Qdrant, Weaviate, Milvus, Chroma, and pgvector - understand the engineering trade-offs and build a production vector store.

Vector Databases Compared - Pinecone, Weaviate, Qdrant, Chroma, pgvector

Systematic comparison of the major vector databases - architecture, managed vs self-hosted, hybrid search, filtering, update performance, consistency, and cost.

Vector Search in Practice

How approximate nearest neighbor search works, how to choose the right vector database, and how to build production-grade retrieval pipelines that stay fast at millions of documents.

Vector Similarity Search Fundamentals

Master cosine similarity, dot product, L2 distance, exact vs approximate search, the curse of dimensionality, and how to evaluate vector search quality with recall@K.

Vectorization with NumPy - Escaping Python's Loop Overhead

Understand why Python loops are slow, how NumPy's C-level loops bypass interpreter overhead, broadcasting rules, views vs copies, memory layout, ufuncs, and real-world data pipeline optimization.

VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

As AI-assisted video creation becomes increasingly practical, instruction-guided video editing has become essential for refining generated or captured f...

VenusBench-Mobile: A Challenging and User-Centric Benchmark for Mobile GUI Agents with Capability Diagnostics

Existing online benchmarks for mobile GUI agents remain largely app-centric and task-homogeneous, failing to reflect the diversity and instability of re...

venv and virtualenv - Python Environment Isolation

Master Python virtual environments at full engineering depth - how venv works at the filesystem level, PATH manipulation, pyvenv.cfg, pyenv for Python version management, and why Docker containers are not a substitute for virtual environments.

Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering

Applying reinforcement learning to improve factual accuracy in knowledge-intensive question answering faces a reward design dilemma. Response-level rewa...

Vero: An Open RL Recipe for General Visual Reasoning

What does it take to build a visual reasoner that works across charts, science, spatial understanding, and open-ended tasks? The strongest vision-langua...

Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization

AI coding agents are increasingly used to write real-world software, but ensuring that their outputs are correct remains a fundamental challenge. Formal...

VIABench: A Comprehensive Video Benchmark Collected from Blind Individuals for Visual Impairment Assistance

Visually impaired individuals (VIIs) encounter significant daily challenges due to limited access to visual information. Although Multimodal Large Langu...

VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech

Large Audio-Language Models (LALMs) are increasingly integrated into daily applications, yet their generative biases remain underexplored. Existing spee...

VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild

LLM-based agents score well on search benchmarks, yet real users consistently find results unsatisfying, revealing a persistent evaluation-experience ga...

Video = World + Event Stream

We present Wan-Streamer v0.3, which reframes our native-streaming interaction model under a single organizing view: a video is a world plus an event str...

Video Generation Models are General-Purpose Vision Learners

Driven by next-token prediction, NLP shifted from task-specific models into powerful generalist foundation models. What, then, is the equivalent catalys...

Video Generation with Predictive Latents

Video Variational Autoencoder (VAE) enables latent video generative modeling by mapping the visual world into compact spatiotemporal latent spaces, impr...

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between infl...

Video-Oasis: Rethinking Evaluation of Video Understanding

The inherent complexity of video understanding makes it difficult to determine whether Video-LLM benchmark performance stems from visual perception, lin...

VideoChat3: Fully Open Video MLLM for Efficient and Generalist Video Understanding

Recent advances in video understanding have spanned motion, long video, and streaming interaction, driving this field toward real-world applications. De...

VideoFlexTok: Flexible-Length Coarse-to-Fine Video Tokenization

Visual tokenizers map high-dimensional raw pixels into a compressed representation for downstream modeling. Beyond compression, tokenizers dictate what...

VideoRAE: Taming Video Foundation Models for Generative Modeling via Representation Autoencoders

Video generative models commonly rely on latent spaces learned by 3D Variational Autoencoders (3D-VAEs). However, conventional 3D-VAEs are mainly optimi...

Vidu S1: A Real-Time Interactive Video Generation Model

We introduce Vidu S1, a real-time interactive video generation model supporting voice control of digital characters. Users can control video generation...

ViMU: Benchmarking Video Metaphorical Understanding

Any new medium, once it emerges, is used for more than the transmission of overt content alone. The information it carries typically operates on two lev...

Vinci2: Providing Proactive Assistance in Continuous Egocentric Videos

When should an intelligent assistant speak up without being asked? Continuous egocentric video offers rich, evolving context that enables a new form of...

Virtual Memory and Page Faults

Understand virtual memory layout, page tables, TLB, huge pages, and page faults - and how these OS mechanisms directly affect PyTorch training, large model loading, and ML dataset memory mapping.

Vision as Unified Multimodal Generation

We formulate computer vision as unified multimodal generation, where heterogeneous visual tasks are expressed in the native text and image generation sp...

Vision-Language Models

How modern AI systems combine vision encoders with language models to understand and reason about images.

Vision-Language Models Struggle to Align Entities across Modalities.

Vision-Language Models Struggle to Align Entities ac... - published at ACL 2025.

Vista4D: Video Reshooting with 4D Point Clouds

We present Vista4D, a robust and flexible video reshooting framework that grounds the input video and target cameras in a 4D point cloud. Specifically,...

Visual Reasoning through Tool-supervised Reinforcement Learning

In this paper, we investigate the problem of how to effectively master tool-use to solve complex visual reasoning tasks for Multimodal Large Language Mo...

Visual Search and Product Discovery

Image embedding models for retail visual search, CLIP-based product discovery, FAISS similarity retrieval, multimodal search combining image and text, and the systems behind shop-the-look features.

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-o...

VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions

Large language models (LLMs) have evolved into interactive agents that collaborate with users in real-world tasks. Effective collaboration in such setti...

ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

Vision-language-action (VLA) models have advanced robot manipulation through large-scale pretraining, but real-world deployment remains challenging due...

VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

Autonomous GUI agents face two fundamental challenges: early stopping, where agents prematurely declare success without verifiable evidence, and repetit...

vLLM and Inference Servers

Learn how production inference servers like vLLM, TGI, TensorRT-LLM, and Ollama combine PagedAttention, continuous batching, and optimized kernels to serve LLMs at scale.

vLLM Architecture and Deployment

Deploy open LLMs at production scale using vLLM - PagedAttention, continuous batching, tensor parallelism, and OpenAI-compatible serving for LLaMA 3 70B and beyond.

VLM3: Vision Language Models Are Native 3D Learners

Vision Language Models (VLMs) enable a unified model to solve various vision tasks through prompting. They have shown promising performance in semantic...

VoxMind: An End-to-End Agentic Spoken Dialogue System

Recent end-to-end spoken dialogue models enable natural interaction. However, as user demands become increasingly complex, models that rely solely on co...

VoxpopuliTTS: a large-scale multilingual TTS corpus for zero-shot speech generation.

VoxpopuliTTS: a large-scale multilingual TTS corpus... - published at COLING 2025.

Wake up for Touch! Mask-isolated Tactile Alignment Learning in MLLMs

Touch supplies the physical grounding needed to perceive intrinsic material properties, such as friction and compliance, that vision alone often cannot...

WanSong v1.0 Technical Report

Music generation foundation models have recently attracted significant industry attention. However, achieving efficient generation and high-fidelity lon...

Warp Divergence and Control Flow

How branch divergence serializes GPU warp execution, the cost of divergence, warp shuffle intrinsics, and concrete techniques for restructuring kernels to minimize divergence.

Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video

Camera-controlled video generation has made substantial progress, enabling generated videos to follow prescribed viewpoint trajectories. However, existi...

Watch Before You Answer: Learning from Visually Grounded Post-Training

It is critical for vision-language models (VLMs) to comprehensively understand visual, temporal, and textual cues. However, despite rapid progress in mu...

Watching the AI Watchdogs: A Fairness and Robustness Analysis of AI Safety Moderation Classifiers.

Watching the AI Watchdogs: A Fairness and Robustness... - published at NAACL 2025.

WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training

End-to-end spoken dialogue models have garnered significant attention because they offer a higher potential ceiling in expressiveness and perceptual abi...

WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

Interactive world models are advancing rapidly, yet existing benchmarks cover only part of the required competencies, leaving no unified standard for sy...

Weak-to-Strong Generalization via Direct On-Policy Distillation

Reinforcement learning with verifiable rewards (RLVR) is a powerful recipe for improving language-model reasoning, but it is expensive to repeat on ever...

Web Scraping Agents

Agent-based web scraping - handling dynamic JavaScript rendering, login flows, multi-page pagination, structured data extraction, and anti-detection techniques.

WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models

Large language models are rapidly evolving into interactive coding agents capable of end-to-end web coding, yet existing benchmarks evaluate only narrow...

WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning

While Large Language Models (LLMs) excel at function-level code generation, project-level tasks such as generating functional and visually aesthetic mul...

Weight Initialization

Why weight initialization determines whether deep networks train or collapse - symmetry breaking failure, Xavier/Glorot derivation, He/Kaiming for ReLU, LSUV, orthogonal init, bias strategies, and full NumPy experiments measuring gradient flow across 10 layers.

Weights & Biases - The ML Experiment Tracking Standard

How W&B's experiment tracking, hyperparameter sweeps, model registry, and artifact management transform chaotic Jupyter notebooks into reproducible, collaborative ML workflows.

Weights & Biases Deep Dive

W&B for production ML teams - run tracking, sweeps, artifact versioning, collaborative reports, alerts, and how it compares to MLflow.

What are AI Agents?

Understand precisely what an AI agent is - the definition, the 5 key properties, the taxonomy, and why LLMs finally made agents practical.

What Are Embeddings and Why They Matter

The fundamental concept of embeddings - mapping meaning to geometric space, cosine similarity, Word2Vec, the king-queen analogy, and why dense retrieval replaced keyword search.

What do Language Models Learn and When? The Implicit Curriculum Hypothesis

Large language models (LLMs) can perform remarkably complex tasks, yet the fine-grained details of how these capabilities emerge during pretraining rema...

What Does Flow Matching Bring To TD Learning?

Recent work shows that flow matching can be effective for scalar Q-value function estimation in reinforcement learning (RL), but it remains unclear why...

What is LLMOps

LLMOps defined - the operational discipline for managing LLM-powered applications in production, why it differs from MLOps, and the full lifecycle every AI engineering team must master.

What is MCP?

The Model Context Protocol - announced by Anthropic in November 2024 - solves the N×M integration problem by giving AI systems a standard way to connect to any tool or data source.

What LLM Forecasters Know but Don't Say: Probing Internal Representations for Calibration and Faithfulness

Large language models fine-tuned for forecasting can be accurate yet poorly calibrated, and their chain-of-thought (CoT) reasoning may not faithfully re...

What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search

Recent work has demonstrated the promise of orchestrating large language models (LLMs) within evolutionary and agentic optimization systems. However, th...

What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning.

What Makes for Good Visual Instructions? Synthesizin... - published at COLING 2025.

What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion

Tokenizers are a crucial component of latent diffusion models, as they define the latent space in which diffusion models operate. However, existing toke...

When Are Multimodal Predictions Biologically Supported? A Diagnostic Evaluation Framework

Multimodal models in oncology can produce accurate predictions, but accurate prediction does not reveal whether the model has learned biology that is sh...

When Background Matters: Breaking Medical Vision Language Models by Transferable Attack

Vision-Language Models (VLMs) are increasingly used in clinical diagnostics, yet their robustness to adversarial attacks remains largely unexplored, pos...

When Can LLMs Learn to Reason with Weak Supervision?

Large language models have achieved significant reasoning improvements through reinforcement learning with verifiable rewards (RLVR). Yet as model capab...

When Classic Cache Policies Fail: Learning-Augmented Replacement for Semantic Retrieval Buffers

LLM agents increasingly rely on retrieval buffers to store and reuse past experience, yet the cache management policies governing these buffers remain l...

When Cloud Agents Meet Device Agents: Lessons from Hybrid Multi-Agent Systems

The design space of agentic AI inference spans two extremes: frontier large language models (LLMs), typically hosted in the cloud and offering strong pe...

When Confidence Misleads: Suffix Anchoring and Anchor-Proximity Confidence Modulation for Diffusion Language Models

Diffusion language models decode text by iteratively denoising masked token sequences, making the choice of which positions to decode a central inferenc...

When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

Many deployments must compare candidate language models for safety before a labeled benchmark exists for the relevant language, sector, or regulatory re...

When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models

Text-to-video diffusion models have enabled open-ended video synthesis, but often struggle with generating the correct number of objects specified in a...

When One Modality Rules Them All: Backdoor Modality Collapse in Multimodal Diffusion Models

While diffusion models have revolutionized visual content generation, their rapid adoption has underscored the critical need to investigate vulnerabilit...

When Reasoning Models Hurt Behavioral Simulation: A Solver-Sampler Mismatch in Multi-Agent LLM Negotiation

Large language models are increasingly used as agents in social, economic, and policy simulations. A common assumption is that stronger reasoning should...

When Should Models Change Their Minds? Contextual Belief Management in Large Language Models

Long-horizon interactions require language models to manage accumulating information: when to update their state, when to preserve their state, and what...

When to Think, When to Speak: Learning Disclosure Policies for LLM Reasoning

In single-stream autoregressive interfaces, the same tokens both update the model state and constitute an irreversible public commitment. This coupling...

When to Trust Imagination: Adaptive Action Execution for World Action Models

World Action Models (WAMs) have recently emerged as a promising paradigm for robotic manipulation by jointly predicting future visual observations and f...

When to Use a Framework

The framework vs raw API decision for agents - what abstractions cost, what they provide, and a decision tree based on your actual requirements.

When to Use Agents

A decision framework for when autonomous agents are appropriate vs. when simpler approaches are better - covering cost of agency, task classification, anti-patterns, and ROI analysis.

When to Use Reasoning Models in Production

A practical decision framework for routing tasks to reasoning models - task taxonomy, cost-benefit analysis, latency trade-offs, and hybrid routing architectures.

When to Use SSMs in Production

A practical deployment guide: use cases where SSMs win, the streaming inference pattern, model availability on HuggingFace, fine-tuning SSMs, and a forward-looking outlook.

When Your Model Stops Working: Anytime-Valid Calibration Monitoring

Practitioners monitoring deployed probabilistic models face a fundamental trap: any fixed-sample test applied repeatedly over an unbounded stream will e...

Where Do LLMs Compose Meaning? A Layerwise Analysis of Compositional Robustness.

Where Do LLMs Compose Meaning? A Layerwise Analysis... - published at EACL 2026.

Where to cut, how deep: BPE and Unigram-LM on chemistry SMILES

Every chemical language model reading SMILES begins with a tokenizer, yet the field has inherited byte-pair encoding (BPE) from natural language with li...

Who Guards the Guardians? The Challenges of Evaluating Identifiability of Learned Representations

Identifiability in representation learning is commonly evaluated using standard metrics (e.g., MCC, DCI, R^2) on synthetic benchmarks with known ground-...

Who Prices Cognitive Labor in the Age of Agents? Compute-Anchored Wages

A natural intuition about the economics of AI agents is that, because agents can be replicated at very low marginal cost, agent labor may be supplied hi...

Why AI Evaluation Is Hard

Understanding the fundamental gap between software testing and AI evaluation - non-determinism, no oracle, emergent failures, and how to build a multi-layered evaluation strategy.

Why an LLM Gateway

The case for centralizing all LLM traffic through a single gateway layer - routing, cost control, fallbacks, and observability without rewriting application code.

Why Can't I Open My Drawer? Mitigating Object-Driven Shortcuts in Zero-Shot Compositional Action Recognition

Zero-Shot Compositional Action Recognition (ZS-CAR) requires recognizing novel verb-object combinations composed of previously observed primitives. In t...

Why Data Versioning

The case for treating datasets as first-class versioned artifacts - regulatory requirements, reproducibility, drift detection, and the approaches to versioning (full copy, delta, pointer).

Why Do LLM-based Web Agents Fail? A Hierarchical Planning Perspective.

Why Do LLM-based Web Agents Fail? A Hierarchical Pla... — published at ACL 2026.

Why Experiment Tracking

The business and technical case for tracking every ML experiment - what to track, why it matters, and what happens when you don't.

Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

Vision-language models (VLMs) achieve strong performance on spatial reasoning benchmarks, yet it remains unclear whether this reflects structured 3D und...

Why Graphs for ML

When tabular data fails - graph formalism, adjacency matrix, Laplacian, graph types, real-world datasets, the Weisfeiler-Lehman test, and why CNNs cannot handle graph-structured data.

Why Human-in-the-Loop Matters

Understand why full automation fails, where the alignment gap lives, what regulations demand, and how to design the right level of human oversight for any AI system.

Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention

Larger models learn tasks smaller models do not. What drives this phenomenon? We develop a simple phenomenological argument that power-law scaling alrea...

Why Linear Recurrent Memory Works in Partially Observable Reinforcement Learning

The family of linear recurrent neural networks has shown strong performance as recurrent memory units in partially observable reinforcement learning. We...

Why Model Compression Matters

The memory wall, inference costs, edge deployment, and latency requirements that make model compression essential for production AI systems - with real cost math, a full compression taxonomy, and decision frameworks for choosing the right technique.

Why Model Merging Exists

The catastrophic forgetting problem, why naive ensembles are too expensive, and the surprising geometric insight that makes model merging possible.

Why Multi-Agent Systems?

The fundamental case for multi-agent: parallelization, specialization, and verification - and the honest cost of coordination overhead.

Why RAG and When Not To

Understand why LLMs hallucinate, what RAG actually solves, and the decision framework for choosing RAG vs fine-tuning vs prompt stuffing.

Why RAG Exists

Understand why Retrieval-Augmented Generation was invented, what problems it solves that fine-tuning and prompt stuffing cannot, and how to architect a minimal RAG pipeline from scratch.

Why Structured Output Matters in Production

The taxonomy of LLM output failures, why prompt-based JSON extraction breaks at scale, the production impact of 5% failure rates, and the spectrum of solutions from prompt engineering to constrained decoding.

Why Synthetic Data

Understand why synthetic data has become central to AI engineering - the labeled data bottleneck, privacy constraints, rare events, LLMs as generators, landmark case studies, and when synthetic beats real.

WildCity: A Real-World City-Scale Testbed for Rendering, Simulation, and Spatial Intelligence

Humans can navigate an unfamiliar city and gradually form a coherent spatial mental map spanning tens of square kilometers. Can AI build spatial represe...

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) harnesses. However,...

WildDet3D: Scaling Promptable 3D Detection in the Wild

Understanding objects in 3D from a single image is a cornerstone of spatial intelligence. A key step toward this goal is monocular 3D object detection--...

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

While GUI agents have shown impressive capabilities in common computer-use tasks such as OSWorld, current benchmarks mainly focus on isolated and single...

Working with 128K+ Context Windows in Production

A complete production engineering guide for building applications with long-context LLMs - model selection, cost management, prompt structure, multi-turn conversation, and memory-augmented systems.

Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies

Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a wo...

World Properties without World Models: Recovering Spatial and Temporal Structure from Co-occurrence Statistics in Static Word Embeddings

Recent work interprets the linear recoverability of geographic and temporal variables from large language model (LLM) hidden states as evidence for worl...

__init__ and Object Construction - Two-Phase Creation at Engineering Depth

__init_subclass__ - The Modern Alternative to Metaclasses

__set_name__ - The Descriptor Naming Protocol

01 - Agent Risk Taxonomy

01 - Task Decomposition

02 - Minimal Footprint Principle

02 - Planning with LLMs

03 - Checkpointing and Recovery

03 - Prompt Injection in Agents

04 - Guardrails and Action Validation

04 - Handling Ambiguity and Clarification

05 - Interruption and Human-in-the-Loop

06 - Evaluation of Long-Horizon Tasks

3D HAMSTER: Bridging Planning and Control in Hierarchical Vision Language Action Models through 3D Trajectory Guidance

3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding

3DTCR: A Physics-Based Generative Framework for Vortex-Following 3D Reconstruction to Improve Tropical Cyclone Intensity Forecasting

3DTV: A Feedforward Interpolation Network for Real-Time View Synthesis

4D Human-Scene Reconstruction from Low-Overlap Captures

4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

A 1/R Law for Kurtosis Contrast in Balanced Mixtures

A Bayesian Updating Framework for Long-term Multi-Environment Trial Data in Plant Breeding

A Benchmark for Interactive World Models with a Unified Action Generation Framework

A Constrained RL Approach for Cost-Efficient Delivery of Latency-Sensitive Applications

A Dataset is Worth 1 MB

A Dirac-Frenkel-Onsager principle: Instantaneous residual minimization with gauge momentum for nonlinear parametrizations of PDE solutions

A distributed semismooth Newton based augmented Lagrangian method for distributed optimization

A Federated Many-to-One Hopfield model for associative Neural Networks

A Foundation Model for Zero-Shot Logical Rule Induction

A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens

A Hybrid Approach for Closing the Sim2real Appearance Gap in Game Engine Synthetic Datasets

A Learning-based Multi-Frame Visual Feature Framework for Real-Time Driver Fatigue Detection.

A multimodal slice discovery framework for systematic failure detection and explanation in medical image classification

A New Kernel Regularity Condition for Distributed Mirror Descent: Broader Coverage and Simpler Analysis

A Note on How to Remove the $\ln\ln T$ Term from the Squint Bound

A note on the area under the likelihood and the fake evidence for model selection

A Novel Computational Framework for Causal Inference: Tree-Based Discretization with ILP-Based Matching

A novel hybrid approach for positive-valued DAG learning

A Practical Analysis of Human Alignment with *PO.

A Predictive View on Streaming Hidden Markov Models

A Proper Scoring Rule for Virtual Staining

A Quantitative Characterization of Forgetting in Post-Training

A Quantized Native Runtime for On-Device Semantic Audio Generation

A recipe for scalable attention-based MLIPs: unlocking long-range accuracy with all-to-all node attention

A Reference Architecture of Reinforcement Learning Frameworks

A Self-Evolving Framework for Efficient Terminal Agents via Observational Context Compression

A Semantic-Aware Layer-Freezing Approach to Computation-Efficient Fine-Tuning of Language Models.

A Sovereign, Open-Source Foundation Model for German and English

A Sparse and Truncated State Vector Simulator for Peaked Circuits

A Stein Identity for q-Gaussians with Bounded Support

A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning

A Temporally Augmented Graph Attention Network for Affordance Classification

A Theory of Contrastive Learning with Natural Images

A theory of learning data statistics in diffusion models, from easy to hard

A Tight Theory of Error Feedback Algorithms in Distributed Optimization

A Training-free LLM-based Approach to General Chinese Character Error Correction.

A Tsetlin Machine-driven Intrusion Detection System for Next-Generation IoMT Security

A two-step sequential approach for hyperparameter selection in finite context models

A unified perspective on fine-tuning and sampling with diffusion and flow models

A Variational Estimator for $L_p$ Calibration Errors

A^2RD: Agentic Autoregressive Diffusion for Long Video Consistency

A^2TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research.

ABot-AgentOS: A General Robotic Agent OS with Lifelong Multi-modal Memory

ABot-AgentOS: A General Robotic Agent OS with Lifelong Multi-modal Memory

ABot-N1: Toward a General Visual Language Navigation Foundation Model

ABot-N1: Toward a General Visual Language Navigation Foundation Model

Abstain-R1: Calibrated Abstention and Post-Refusal Clarification via Verifiable RL

Abstract Base Classes - Enforcing Interfaces at Engineering Depth

AcademiClaw: When Students Set Challenges for AI Agents

Accelerating Speculative Decoding with Block Diffusion Draft Trees

Accurate and Efficient Hybrid-Ensemble Atmospheric Data Assimilation in Latent Space with Uncertainty Quantification

Accurate and Reliable Uncertainty Estimates for Deterministic Predictions Extensions to Under and Overpredictions

Accurate and scalable exchange-correlation with deep learning

Accurate, Interdisciplinary and Transparent Structure-property Understanding with Deep Native Structural Reasoning

ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation

Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling

Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

Action Images: End-to-End Policy Learning via Multiview Video Generation

Activation Functions

Active Bipartite Ranking with Smooth Posterior Distributions

init and Object Construction - Two-Phase Creation at Engineering Depth