Discriminative Visual Process Rewards for Scaling Thinking at Test-Time with Images
Abstract
The “thinking with images” paradigm has led multimodal large language models to generate intermediate visual steps—such as cropping, annotation, spatial localization, and sketches—to enhance high-resolution perception and complex reasoning. However, existing multimodal Process Reward Models (PRMs) evaluate only textual reasoning and cannot judge the correctness of these visual steps, creating a key gap when visual reasoning is essential for solving tasks. We propose Discriminative Visual Process Reward Model (DiscPRM), a multimodal PRM that jointly evaluates textual and visual intermediate steps by modeling visual reasoning trajectories, image operations, and text-image consistency. To support this, we build VTReward-100K, a dataset of step-by-step visual reasoning sequences with supervision. Experiments show that using DiscPRM for Best-of-N process supervision substantially improves multimodal reasoning performance on tasks requiring visual intermediate steps, achieving over 5% gains across benchmarks. We further introduce VABench, the first benchmark for evaluating PRMs on visual reasoning error detection. We hope this work can provide foundational support for advancing the emerging direction of visual–textual process reward.