CofactGVR: Counterfactual Intervention for Grounded Visual Reasoning
Abstract
Despite rapid progress in Grounded Visual Reasoning (GVR) with MLLMs and RL-style fine-tuning, existing approaches often lack effective learning signals for intermediate grounding decisions and are prone to shortcut solutions. In this work, we explicitly decompose GVR into Evidence Generation followed by Counterfactual Answer Reasoning, and formalize this structure as a Causal Grounding Graph (CGG) in which the generated evidence acts as a causal mediator. Building on this formulation, we propose CofactGVR, which estimates the mediator’s utility via a matched counterfactual intervention that perturbs the predicted region while keeping the original image–question context fixed. The factual–counterfactual reward gap yields a principled intermediate bonus, selectively assigned to high-quality factual rollouts to promote evidence-faithful reasoning. To further stabilize and efficiently exploit this causal training signal, we incorporate a Quantile-filtered Prioritized Advantage Sampling (Q-PAS) strategy that preferentially updates on trajectories with high-magnitude advantages while filtering low-signal samples. Extensive experiments across GVR benchmarks show consistent improvements, indicating that CofactGVR strengthens reliance on informative visual evidence under controlled interventions. The source codes will be publicly available.