RefChess: Monte-Carlo Move Selection for Zero-Shot Referring Image Segmentation
Abstract
Recent advances in zero-shot referring image segmentation (RIS), driven by foundation models such as SAM and CLIP, have improved cross-modal alignment between visual regions and natural language expressions. Nevertheless, selecting the correct segmentation proposal remains challenging, as existing methods typically rely on independent proposal scoring and lack contextual reasoning among visually similar candidates. To address this limitation, we propose RefChess, a training-free framework that reformulates proposal selection as a decision-making problem under contextual perturbations rather than a single-step ranking task. RefChess models each proposal as a candidate chess move and applies Monte-Carlo Tree Search to evaluate its robustness by simulating interactions with competing regions, guided by a stability-aware reward that integrates language decomposition, vision–language similarity, object-centric cues, and spatial guidance signals. Experimental results on standard RIS benchmarks indicate that this decision-centric formulation leads to consistent improvements in robustness and referring segmentation performance. Code is available at \url{anonymous URL}.