Probabilistic Salient Object Ranking
Abstract
Salient Object Ranking (SOR) aims to study how humans visually explore complex scenes by predicting an ordered sequence of objects that attracts our attention. Existing SOR approaches typically model this ranking deterministically, assuming a single, fixed ranking sequence of attention. However, such deterministic SOR fails to capture the true nature of human attention. We observe that human attention shifts exhibit variability and stochasticity, showing that the next object of fixation is not a definitive choice but rather a probability distribution. Yet, existing SOR methods and evaluation metrics do not account for this inherent randomness. To address this fundamental problem, we propose ProbSOR, a novel Probabilistic Salient Object Ranking model that explicitly learns the uncertainty of attention shifts by incorporating Group Relative Policy Optimization (GRPO). We leverage a Vision-Language Model (VLM) as the foundation for ProbSOR to identify salient objects and infer their ranked order, utilizing a segmentation decoder for precise object extraction. We also propose a new metric tailored to ProbSOR, as existing SOR metrics only support deterministic rankings. Further, we construct a ProbSOR dataset comprising 15,000 probabilistic SOR samples, to support both model training and evaluation. Extensive experiments demonstrate that ProbSOR achieves strong performances in salient object ranking under both our proposed and traditional benchmarks.