How does visual work in practice?

How and What to Imagine? Visual Thinking in Unified Multimodal Models for Cross-View Spatial Reasoning covers imagine, visual, thinking from first principles with code examples. Free lesson at https://engineersofai.com/docs/research/paper-breakdowns/2026-05-26-how-and-what-to-imagine-visual-thinking-in-unified-multimodal-models-for-crossvi

What is the difference between imagine and thinking?

See the full breakdown at https://engineersofai.com/docs/research/paper-breakdowns/2026-05-26-how-and-what-to-imagine-visual-thinking-in-unified-multimodal-models-for-crossvi

How and What to Imagine? Visual Thinking in Unified Multimodal Models for Cross-View Spatial Reasoning

:::info Stub — Full Engineering Breakdown Coming This paper was featured on Hugging Face Daily Papers on 2026-05-26 with 18 upvotes. A full breakdown with production viability rating, implementation notes, and honest limitations is being written. Subscribe to AI Letters → :::


Authors	Qian Yang et al.
Year	2026
HF Upvotes	18
arXiv	2605.27310
PDF	Download
HF Page	View on Hugging Face

Abstract

Cross-view spatial reasoning remains a weak spot for vision-language models (VLMs): they often reason in language and lose the fine-grained geometry needed for the task. Thinking with images aims to address this by generating an intermediate thinking image, but recent work shows that models often ignore the visual evidence in these traces. We therefore ask how to make visual thinking matter, and what kind of visual thinking works best. We study these questions in unified multimodal models (UMMs), which natively support interleaved image-text generation. For the first question, we propose View Dropout (VDrop), a training-time intervention that hides parts of one input view from the answer span while keeping them visible to the thinking-image tokens. This encourages the model to use the thinking image when answering, instead of relying only on the input views. Once the thinking image is used for answer prediction, we study which type of visual thinking is most effective. We frame this as a learnability-informativeness tradeoff and compare three thinking-image variants: top-down, panoramic, and point-matching renderings. Trained on synthetic scenes and evaluated on five real-world out-of-domain benchmarks, panoramic visual thinking with VDrop is the only configuration that is both informative and learnable, and it achieves the best out-of-domain generalization.

Engineering Breakdown

The Problem

Thinking with images aims to address this by generating an intermediate thinking image, but recent work shows that models often ignore the visual evidence in these traces.

The Approach

For the first question, we propose View Dropout (VDrop), a training-time intervention that hides parts of one input view from the answer span while keeping them visible to the thinking-image tokens.

Key Results

Trained on synthetic scenes and evaluated on five real-world out-of-domain benchmarks, panoramic visual thinking with VDrop is the only configuration that is both informative and learnable, and it achieves the best out-of-domain generalization.

Research Areas

This paper contributes to the following areas of AI/ML engineering:

Machine learning
Deep learning
Neural networks
Model optimization
AI systems
Multimodal

:::tip Subscribe Get weekly breakdowns of papers like this in AI Letters - the newsletter for engineers building production AI systems. :::

Back to Research Lab → · Subscribe to AI Letters →

Abstract​

Engineering Breakdown​

The Problem​

The Approach​

Key Results​

Research Areas​