Poster
in
Workshop: Multi-modal Foundation Model meets Embodied AI (MFM-EAI)

An Embodied Generalist Agent in 3D World

Jiangyong Huang · Silong Yong · Xiaojian Ma · Xiongkun Linghu · Puhao Li · Yan Wang · Qing Li · Song-Chun Zhu · Baoxiong Jia · Siyuan Huang

Project Page [ OpenReview]

Abstract

Leveraging massive knowledge from large language models (LLMs), recent machine learning models show notable successes in general-purpose task solving in diverse domains such as computer vision and robotics. However, several significant challenges remain: (i) most of these models rely on 2D images yet exhibit a limited capacity for 3D input; (ii) these models rarely explore the tasks inherently defined in 3D world. We introduce LEO, an embodied multi-modal generalist agent that excels in perceiving, grounding, reasoning, planning, and acting in the 3D world. LEO is trained with a unified task interface, model architecture, and objective in two stages: (i) 3D vision-language (VL) alignment and (ii) 3D vision-language-action (VLA) instruction tuning. We collect large-scale datasets comprising diverse object-level and scene-level tasks, which require considerable understanding of and interaction with the 3D world. Through extensive experiments, we demonstrate LEO’s remarkable proficiency across a wide spectrum of tasks, including 3D captioning, question answering, embodied reasoning, navigation and manipulation.

Chat is not available.