VLA-Arena: An Open-Source Framework for Benchmarking Vision-Language-Action Models
Abstract
While Vision-Language-Action models (VLAs) are rapidly advancing toward generalist robot policies, quantitatively characterizing their capability boundaries and failure modes remains challenging. To address this, we introduce VLA-Arena, a comprehensive benchmark. It features a novel structured task design framework to quantify difficulty across three orthogonal axes: (1) Task Structure, (2) Language Command, and (3) Visual Observation. This allows us to systematically design tasks with fine-grained difficulty levels, enabling a precise measurement of model capability frontiers. For task structure, VLA-Arena comprises 11 task suites organized into four dimensions: Safety, Distractor, Extrapolation, and Long Horizon, totaling 170 tasks. Each suite spans three difficulty levels (L0–L2), with fine-tuning restricted to L0 to rigorously assess generalization. Orthogonal to this, language (W0-W4) and visual (V0-V4) perturbations can be applied to any task as diagnostic probes to distinguish robust grounding from superficial pattern matching. Our extensive evaluation of state-of-the-art VLAs reveals critical limitations: memorization over generalization, superficial visual perception, and a neglect of safety constraints. Additionally, model rank reversals across L0–L2 validate that each level provides non-redundant insights. To foster research addressing these model limitations and ensure reproducibility, we provide the complete VLA-Arena framework, including an end-to-end toolchain from task definition to automated evaluation and the VLA-Arena-S/M/L datasets for fine-tuning. Our benchmark, datasets, models, and leaderboard will be open-sourced.