Dismantling the Illusion of Vision-Language-Action Models Competence via Explicit Distributional Shifts
Abstract
Given that simulation can never exhaustively enumerate reality, generalization is the determining factor for whether Vision-Language-Action (VLA) models can translate benchmark success into real-world functionality. However, current evaluation protocols often incentivize mechanical memorization rather than robust policy learning, leading to a paradoxical duality of failure: high-scoring models exhibit spurious invariance to semantic changes while simultaneously displaying extreme brittleness to trivial environmental perturbations. To address this, we introduce LIBERO-Gen, a diagnostic benchmark systematically designed to shift evaluation from intuition-driven heuristics to explicit distributional assumptions. Through a hierarchical protocol spanning In-distribution, Compositional, and Domain Generalization, LIBERO-Gen reveals performance stratifications previously masked by standard metrics. Our analysis identifies Pi0.5 as the top performer (64.0% in Spatial-CG; 21.2% in Task-CG). By identifying perceptual instability and action binding collapse as primary failure modes while validating the efficacy of structured ``Stair” sampling, LIBERO-Gen establishes a rigorous baseline for deployment reliability.