Poster Wed, Jul 8, 2026 • 5:00 PM – 6:45 PM KST Coex: HALL A

Vision-Language-Action Pretraining from Large-Scale Human Videos

Hao Luo ⋅ Yicheng Feng ⋅ Wanpeng Zhang ⋅ Sipeng Zheng ⋅ Ye Wang ⋅ Haoqi Yuan ⋅ jiazheng liu ⋅ Chaoyi Xu ⋅ Haiweng Xu ⋅ Qin Jin ⋅ Zongqing Lu

Abstract

Existing Vision-Language-Action (VLA) models struggle with complex manipulation tasks requiring high dexterity and generalization, primarily due to their reliance on synthetic data with significant sim-to-real gaps or limited teleoperated demonstrations. To address this bottleneck, we propose leveraging human hands as a manipulator template, capitalizing on the rich dexterity and scalability present in web data of human manipulation. Our approach introduces physical instruction tuning, a novel training paradigm that combines large-scale VLA pretraining from human videos, perspective spatial alignment for reasoning in a unified physical space, and post-training adaptation in physical environments. Additionally, we introduce a part-level motion tokenization method that achieves millimeter-level reconstruction accuracy to model precise hand trajectories serving as scalable motion primitives. To support our paradigm, we develop a comprehensive data curation pipeline that integrates heterogeneous sources into a large-scale dataset with millions of motion-based instructional instances. Empirically, our model demonstrates superior performance in hand motion generation and instruction following, adhering to favorable scaling laws with respect to model and data sizes. Importantly, we demonstrate promising capabilities to robotic dexterous manipulation, validating the effectiveness of bridging the human-robot embodiment gap.