Poster Wed, Jul 8, 2026 • 1:00 AM – 2:45 AM PDT HALL A #1710

V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning

Zhiwei Ning ⋅ Xuanang Gao ⋅ Jiaxi Cao ⋅ Gengming Zhang ⋅ Shengnan Ma ⋅ Wenwen Tong ⋅ Hanming Deng ⋅ JIE YANG ⋅ Wei Liu

Project Page

Abstract

Multimodal large language models (MLLMs) have achieved remarkable success in general perception, yet complex multi-step visual reasoning remains a persistent challenge. Although recent agentic approaches incorporate tool use, they often neglect critical execution feedback. Consequently, they suffer from the imagination-action-observer (IAO) bias, a misalignment between prior imagination and observer feedback that undermines reasoning stability and optimality. To bridge this gap, we introduce V-ABS, an action-observer driven beam search framework that enables deliberate reasoning through thinker-actor-observer iterations. We also propose an entropy-based adaptive weighting algorithm to mitigate the IAO bias by dynamically balancing the confidence scores between the policy priors and the observational feedback. Moreover, we construct a large-scale supervised fine-tuning (SFT) dataset comprising over 80k samples to guide the model to assign higher prior confidence to correct action paths. Extensive experiments across eight diverse benchmarks show that V-ABS achieves state-of-the-art performance, delivering an average improvement of 19.7\% on the Qwen3-VL-8B baseline and consistent gains across both open-source and proprietary models.

Lay Summary

Large AI models that understand images have made great progress, but they still struggle with tasks requiring multiple steps of reasoning. Current approaches give models access to tools (like zooming in or running calculations), but often ignore whether the tool actually returned useful results. This creates a gap between what the model expects and what actually happens, causing errors to accumulate. Therefore, we introduce V-ABS, a framework that helps visual AI models reason more carefully by exploring multiple reasoning paths in parallel — similar to a chess player considering several moves ahead. At each step, the model acts, observes the result, and adaptively balances its initial plan against the actual feedback. We also created a dataset of over 80,000 examples to train better initial reasoning. Across eight benchmarks, V-ABS improves accuracy by an average of 19.7% and delivers consistent gains across both open-source and commercial AI systems.