How does visually work in practice?

iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning covers internalizing, visually, grounded from first principles with code examples. Free lesson at https://engineersofai.com/docs/research/paper-breakdowns/2026-05-29-ivgr-internalizing-visually-grounded-reasoning-for-mllms-with-reinforcement-lear

What is the difference between internalizing and grounded?

See the full breakdown at https://engineersofai.com/docs/research/paper-breakdowns/2026-05-29-ivgr-internalizing-visually-grounded-reasoning-for-mllms-with-reinforcement-lear

iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning

:::info Stub — Full Engineering Breakdown Coming This paper has a linked code implementation and was featured on Hugging Face Papers with 1 upvotes. A full breakdown with production viability rating, implementation notes, and honest limitations is being written. Subscribe to AI Letters → :::


Authors	Chang-Bin Zhang et al.
Year	2026
HF Upvotes	1
arXiv	2605.31096
PDF	Download
Code	https://github.com/Visual-AI/iVGR

Abstract

While visually grounded Chain-of-Thought (CoT) has emerged as a promising paradigm to enhance fine-grained perception in multimodal large language models (MLLMs), its efficacy during the inference phase remains underexplored. In this work, we empirically find that mandating explicit object boxes in visually grounded CoT during inference often degrades performance compared to standard textual CoT, which reasons without explicit visual grounding. We hypothesize that the visual localization capability can be internalized into the textual CoT and that the mandatory explicit grounding introduces unnecessary interference with the model's primary objective of answer prediction. To address this problem, we propose Internalizing Visually Grounded Reasoning (iVGR), a novel reinforcement learning framework that transfers localization capabilities into the textual reasoning process. We employ a dual-stream training strategy, where a textual stream is aligned with a high-quality visually grounded stream via a proposed consistency reward, enabling the model to localize accurately without explicit grounding during inference. Extensive experiments demonstrate that our method significantly outperforms existing baselines on fine-grained benchmarks, while maintaining the flexibility to support tool-assisted inference workflows.

Engineering Breakdown

The Problem

The Approach

In this work, we empirically find that mandating explicit object boxes in visually grounded CoT during inference often degrades performance compared to standard textual CoT, which reasons without explicit visual grounding. To address this problem, we propose Internalizing Visually Grounded Reasoning (iVGR), a novel reinforcement learning framework that transfers localization capabilities into the textual reasoning process.

Key Results

Extensive experiments demonstrate that our method significantly outperforms existing baselines on fine-grained benchmarks, while maintaining the flexibility to support tool-assisted inference workflows.

Research Areas

This paper contributes to the following areas of AI/ML engineering:

Machine learning
Deep learning
Neural networks
Model optimization
AI systems
Internalizing

:::tip Subscribe Get weekly breakdowns of papers like this in AI Letters - the newsletter for engineers building production AI systems. :::

Back to Research Lab → · Subscribe to AI Letters →

Abstract​

Engineering Breakdown​

The Problem​

The Approach​

Key Results​

Research Areas​