How does towards work in practice?

LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence covers llavaonevision2, towards, nextgeneration from first principles with code examples. Free lesson at https://engineersofai.com/docs/research/paper-breakdowns/2026-05-25-llavaonevision2-towards-nextgeneration-perceptual-intelligence

What is the difference between llavaonevision2 and nextgeneration?

See the full breakdown at https://engineersofai.com/docs/research/paper-breakdowns/2026-05-25-llavaonevision2-towards-nextgeneration-perceptual-intelligence

LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence

:::info Stub — Full Engineering Breakdown Coming This paper was featured on Hugging Face Daily Papers on 2026-05-25 with 25 upvotes. A full breakdown with production viability rating, implementation notes, and honest limitations is being written. Subscribe to AI Letters → :::


Authors	Xiang An et al.
Year	2026
HF Upvotes	25
arXiv	2605.25979
PDF	Download
HF Page	View on Hugging Face

Abstract

We introduce LLaVA-OneVision-2 (LLaVA-OV-2), the most capable vision-language model in the LLaVA-OneVision series to date, achieving superior performance across a broad range of multimodal benchmarks. The model builds on a native OneVision-Encoder and incorporates Windowed Attention for efficient local computation while maintaining native resolution. Its key advance is codec-stream tokenization: it treats compressed video as a continuous bit-cost stream, where bit-cost dynamics determine adaptive temporal groups, and motion-residual cues select salient spatial evidence into compact visual canvases. This allocation concentrates a limited token budget on event-bearing content, enabling more stable long-video token compression than fixed groups of pictures. A shared 3D RoPE further places codec canvases, sampled frames, and images in a unified spatiotemporal coordinate system. Furthermore, we build the LLaVA-OV-2 data and training stack around large-scale open supervision: approximately 8M re-captioned video samples for pretraining, a 4M-sample spatial corpus for fine-tuning. We also introduce JumpScore, a temporal-localization benchmark targeting fine-grained grounding in high-frequency, densely repeated motion, a regime underrepresented by existing video evaluations. A standout capability of LLaVA-OV-2 is its unified perception across video understanding, temporal grounding, spatial grounding, and manipulation-trace reasoning. On JumpScore, LLaVA-OneVision-2-8B reaches 74.9 JumpScore mAP, surpassing Qwen3-VL-8B (30.1) by +44.8 points; under matched visual-token budgets on the same benchmark, codec-stream inputs improve temporal grounding over frame sampling by +9.7 points. Across standard benchmarks, LLaVA-OneVision-2-8B further outperforms Qwen3-VL-8B by +4.3 average points on video tasks, +5.3 on spatial tasks, and +15.6 average J&F on tracking tasks.

Engineering Breakdown

The Problem

The Approach

Key Results

On JumpScore, LLaVA-OneVision-2-8B reaches 74.9 JumpScore mAP, surpassing Qwen3-VL-8B (30.1) by +44.8 points; under matched visual-token budgets on the same benchmark, codec-stream inputs improve temporal grounding over frame sampling by +9.7 points.

Research Areas

This paper contributes to the following areas of AI/ML engineering:

Machine learning
Deep learning
Neural networks
Model optimization
AI systems
Llavaonevision2

:::tip Subscribe Get weekly breakdowns of papers like this in AI Letters - the newsletter for engineers building production AI systems. :::

Back to Research Lab → · Subscribe to AI Letters →

Abstract​

Engineering Breakdown​

The Problem​

The Approach​

Key Results​

Research Areas​