A Generalist Pair-wise Progress Critic Model for Vision-Language-Action Robots
Abstract
Recent advances in Vision-Language-Action (VLA) models have significantly improved robotic perception and manipulation capabilities, but still struggling to adapt in dynamic, open-ended real-world environments due to a lack of reliable task progress feedback and improvement mechanisms. To address these challenges, we propose a generalist Vision Language Action-Critic model, VLAC, which can integrate both human and robot data, and unify action policy and task progress critic within a single autoregressive architecture. Specifically, we propose a scalable and generalizable pair-wise progress understanding approach that can predict the delta of task progress between two steps in a trajectory and generate correct actions to complete the task. Then, we trained the model on large-scale, multi-source human, robot, and general vision-language data for a generalist. Furthermore, we deploy reinforcement learning where VLAC can autonomously evaluate task progress to provide intrinsic rewards. Extensive evaluations demonstrate that our model generalizes effectively across diverse tasks and environments, leveraging its pair-wise progress understanding to provide reliable dense rewards, robust action generation, and significant improvements in real-world reinforcement learning.