POLIA: Policy Optimization with Visual-Object-Level Intrinsic Advantage for Multimodal Reasoning
Abstract
Recent advances in group-based reinforcement learning (RL) greatly improve LLMs' ability in text reasoning. Yet, these methods lack sufficient modeling of multimodal information, leading to significant reasoning hallucination. In this work, we propose POLIA, a novel group-based RL method with visual-object-level intrinsic advantage for multimodal reasoning. POLIA introduces two advantage computation stages over candidate answers and visual objects, respectively. The answer-level extrinsic advantages are computed based on the extrinsic rewards of a group of candidate answers. Moreover, we compute an intrinsic advantage for each visual object based on its confidence score and reference relations with final answers. Intuitively, the intrinsic advantage of an object reflects its potential contribution to the correct answer. This two-stage advantage computation ensures an accurate credit assignment mechanism over multimodal reasoning sequences with multiple visual objects. Experimental results on diverse multimodal reasoning benchmarks show that POLIA significantly outperforms open MLLMs and strong baselines. Our code is available at https://anonymous.4open.science/r/POLIA-AB1D.