Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs
Abstract
Vision-Language-Action (VLA) models are bottlenecked by the scarcity of expert demonstrations—expensive triplets of observations, language instructions, and actions. We propose that learning ''how to move'' can be decoupled from learning ''what to do,'' and that the former requires no task labels at all. Our two-stage framework, Task-Agnostic Pretraining (TAP) first pre-trains on abundant, cheap task-agnostic data (discarded off-task trajectories or autonomous robot play) using an Inverse Dynamics objective that predicts actions from consecutive observations. This self-supervised phase instills physical affordances—grasping, contact dynamics, end-effector control—without human annotation. A lightweight second stage then aligns these physical priors with language instructions using minimal expert data. On the SIMPLER benchmark, our approach matches models trained on 1M+ expert trajectories while using orders of magnitude less labeled data, achieving a 10\% absolute gain over standard behavior cloning. In real-world WidowX experiments, it surpasses internet-scale baselines under visual distribution shifts (e.g., 25\% vs. 0\% under camera perturbations), demonstrating that task-agnostic pretraining yields robust, transferable physical representations for Embodied AI.