Seeing Without Understanding: Disentangling Perception, Reasoning, and Simulation in VLM Gameplay
Dingyang Jin ⋅ Jiawei He ⋅ Calvin Lo ⋅ Steven Hu ⋅ RYAN RAD
Abstract
While Vision-Language Models (VLMs) excel on static visual benchmarks, they consistently underperform in game-based reasoning environments. Existing evaluations conflate failures in perception, rule comprehension, and reasoning. We propose a two-stage diagnostic framework that decomposes VLM performance into testable components: controlled perception tests isolating visual encoding, and a $2\times2$ diagnostic matrix with a six-level rule complexity ladder evaluated in both explicit verification and predictive simulation modes. Experimenting with six state-of-the-art VLMs reveals three systematic failure patterns: (1) coordinated spatial drift, where off-by-one localization errors among adjacent pieces share the same shift direction at $1.5-1.9\times$ the rate expected under spatial independence; (2) perception-reasoning dissociation, where models correctly verify board states but fail to apply rules—at complex constraint levels, perception remains relatively stable while reasoning accuracy plummets, with even the best-performing model capped at $75\%$ and others ranging from $37\%$ to $64\%$; and (3) a simulation gap, with performance dropping by up to $27$ points when predicting future states versus verifying observed outcomes. These limitations persist across model scales, suggesting persistent limitations in bridging visual encoding and logical simulation.
Successful Page Load