Poster

Cross-Embodiment Robot Foundation World Models with Latent Actions

Huang Huang ⋅ Sriram Yenamandra ⋅ Arjun Majumdar ⋅ Elie Aljalbout ⋅ Tushar Nagarajan ⋅ Jimmy Yang ⋅ Akshara Rai ⋅ Michael Rabbat ⋅ Li Fei-Fei ⋅ Jiajun Wu ⋅ Tingfan Wu ⋅ Franziska Meier

Abstract

The diversity of robot embodiments and action spaces makes it challenging to build robot world models that generalize across different embodiments. We introduce the Latent Action-Conditioned Robot World Model (LAC-WM), which operates within a learned unified latent action space shared across diverse embodiments. This unified action space improves the world model’s performance when adapted to previously unseen robot embodiments. We compare LAC-WM with an Explicit Action-Conditioned World Model (EAC-WM), which conditions on explicit motion labels. Our results shows that explicit action conditioning leads to disjoint action representations across embodiments, limiting downstream performance when adapting to new robots. We evaluate both models on dexterous manipulation tasks and a modified LIBERO benchmark. LAC-WM improves downstream performance over EAC-WM by up to 46.7% on dexterous manipulation and 11.7% on LIBERO. Crucially, the unified latent action space allows LAC-WM’s downstream performance to scale positively with the number of embodiments used during pretraining. In contrast, the disjoint action space in EAC-WM leads to decreased performance as the number of pretraining embodiments increases. These results highlights the importance of a unified action space for efficient cross-embodiment learning, addressing a key challenge in robotics.

Lay Summary

Robots come in many forms. Some use two-finger grippers, while others have multi-finger hands that move more like human hands. This diversity makes it difficult to train a single “robot world model” that can learn from many robots and predict what will happen when a robot takes an action, because each robot represents its actions differently. We built LAC-WM, a robot world model that learns a shared “action language” from videos of humans and different robots manipulating objects. Rather than relying directly on each robot’s specific control signals, LAC-WM learns the underlying meaning of an action, such as reaching toward an object or lifting it, in a representation that can transfer across different robot bodies. This shared representation enables the model to learn from diverse robots and adapt to a new robot it has never seen before. When used to choose among possible robot actions, LAC-WM improves task success by up to 46.7% on dexterous manipulation and increases success from 82.4% to 92.0% on pick-and-place tasks. More importantly, LAC-WM improves as it learns from more robots, while a standard method based on robot-specific action signals gets worse. These results suggest a path toward robot foundation world models that can support many different robots, rather than being rebuilt for each new robot.