How does benchmarking work in practice?

Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World covers clawanything, benchmarking, alwayson from first principles with code examples. Free lesson at https://engineersofai.com/docs/research/paper-breakdowns/2026-05-25-clawanything-benchmarking-alwayson-personal-assistants-with-broader-access-to-us

What is the difference between clawanything and alwayson?

See the full breakdown at https://engineersofai.com/docs/research/paper-breakdowns/2026-05-25-clawanything-benchmarking-alwayson-personal-assistants-with-broader-access-to-us

Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World

:::info Stub — Full Engineering Breakdown Coming This paper was featured on Hugging Face Daily Papers on 2026-05-25 with 23 upvotes. A full breakdown with production viability rating, implementation notes, and honest limitations is being written. Subscribe to AI Letters → :::


Authors	Yusong Lin et al.
Year	2026
HF Upvotes	23
arXiv	2605.26086
PDF	Download
HF Page	View on Hugging Face

Abstract

Large language model agents are increasingly envisioned as always-on personal assistants with access to anything relevant in the user's digital world. Yet current systems operate over only narrow slices of that world, limiting context-sensitive reasoning and effective assistance. Existing benchmarks similarly provide only partial user state and therefore fail to capture performance in such a broad, always-on setting. To address this gap, we introduce Claw-Anything, a benchmark that expands agent context along three dimensions: long-horizon activity histories, interdependent backend services, and integrated GUI and CLI interaction across multiple devices. To instantiate this setting, we simulate months of user activity through multi-round event injection, producing complex world states and realistic noise, including irrelevant events and conflicting signals. Agents must reason over rich contextual environments while remaining robust to such noise. This expanded scope also enables the evaluation of proactive assistance, requiring agents to anticipate user needs and deliver timely recommendations. Experiments show that GPT-5.5 achieves only 34.5% pass@1, substantially below prior benchmarks, underscoring a gap between current agent capabilities and the demands of always-on personal assistance. Alongside the benchmark, we release an automated data-generation pipeline that yields 2,000 training environments and improves the base model by 23.7%, demonstrating its utility of scalable data infrastructure.

Engineering Breakdown

The Problem

Existing benchmarks similarly provide only partial user state and therefore fail to capture performance in such a broad, always-on setting. Experiments show that GPT-5.5 achieves only 34.5% pass@1, substantially below prior benchmarks, underscoring a gap between current agent capabilities and the demands of always-on personal assistance.

The Approach

To address this gap, we introduce Claw-Anything, a benchmark that expands agent context along three dimensions: long-horizon activity histories, interdependent backend services, and integrated GUI and CLI interaction across multiple devices.

Key Results

Alongside the benchmark, we release an automated data-generation pipeline that yields 2,000 training environments and improves the base model by 23.7%, demonstrating its utility of scalable data infrastructure.

Research Areas

This paper contributes to the following areas of AI/ML engineering:

Machine learning
Deep learning
Neural networks
Model optimization
AI systems
Clawanything

:::tip Subscribe Get weekly breakdowns of papers like this in AI Letters - the newsletter for engineers building production AI systems. :::

Back to Research Lab → · Subscribe to AI Letters →

Abstract​

Engineering Breakdown​

The Problem​

The Approach​

Key Results​

Research Areas​