Cross-Embodiment Robot Foundation World Models with Latent Actions
Abstract
The diversity of robot embodiments and action spaces makes it challenging to build robot world models that generalize across different embodiments. We introduce a Latent Action Conditioned Robot World Model (LAC-WM), which operates within a learned unified latent action space shared across diverse embodiments. We show how this unified action space improves the world model’s performance when adapted to previously unseen robot embodiments. We compare LAC-WM to a baseline model, Explicit Action Conditioned World Model (EAC-WM) conditioned on explicit motion labels. Our results show that conditioning on explicit labels creates disjoint action spaces across embodiments, limiting downstream task performance when adapting to new robots. We evaluate both models on a dexterous manipulation task. The latent action-conditioned model LAC-WM achieves up to a 46.7% improvement in performance over EAC-WM. Crucially, the unified latent action space allows LAC-WM’s downstream performance to scale positively with the number of embodiments used during pretraining. In contrast, the disjoint action space in EAC-WM leads to decreased performance as the number of pretraining embodiments increases. These results highlights the importance of a unified action space for efficient cross-embodiment learning, addressing a key challenge in robotics.