Hidden Sensitivity in Spatial Reasoning Evaluation: Diagnosis and Re-ranking with VSI-Bench
Abstract
We introduce FN-VSI, a factor-normalized score for VSI-Bench based on an information-theoretic diagnostic, which substantially re-ranks vision-language models (VLMs). Spatial reasoning benchmarks for VLMs are typically reported as aggregate task scores, but such scores are not determined by model ability alone. VSI-Bench performance is strongly affected by task-agnostic factors such as scene source, object category, ground-truth answer value, and low-level visual properties of the input video, none of which are the intended targets of evaluation. Since these factors are imbalanced and mutually entangled, raw score differences conflate spatial reasoning ability with benchmark composition. To disentangle their effects, we introduce FST, a diagnostic that estimates each factor's contribution by comparing its marginal and conditional mutual information with model scores, classifying factors as direct contributors, surface correlates, suppressed effects, or negligible factors. Across VSI-Bench tasks and multiple VLMs, ground-truth answer value and queried object type emerge as strong direct contributors, while several apparent sensitivities disappear after adjustment. Building on this diagnosis, FN-VSI reweights benchmark instances to neutralize the genuine contributors identified by FST, indicating that reported spatial reasoning gains can depend on the factor mixture of the benchmark rather than on genuine ability.