Learning Transferable Interaction Primitives from Game Videos for Humanoids
Abstract
Learning humanoid control from video provides a scalable alternative to the scarcity of high-fidelity robot data. Existing methods, however, often rely on curated datasets and treat video as passive kinematic priors. They fail to capture dynamic humanoid interactions with the environment, which are essential for real-world deployment. To address this, we propose TRansferable Interaction Primitives (TRIP), a framework designed to extract and ground interactions from unstructured, unlabeled game videos for physical controllers. TRIP explicitly models dependencies between motion dynamics and environmental context via a discrete library of interaction-based action primitives. To bridge the reality gap, we introduce a shared context latent space that aligns implicit video-domain features with functional target-domain observations, enabling the seamless transfer of video-mined strategies to reinforcement learning policies. Our experiments on complex terrain navigation demonstrate that TRIP achieves significant improvements in task performance, sample efficiency, and robustness.