Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Multi-modal Foundation Model meets Embodied AI (MFM-EAI)

GROOT-1.5: Learning to Follow Multi-Modal Instructions from Weak Supervision

Shaofei Cai · Bowei Zhang · Zihao Wang · Xiaojian Ma · Anji Liu · Yitao Liang


Abstract: This paper studies the problem of learning an agent policy that can follow various forms of instructions. Specifically, we focus on multi-modal instructions: the policy is expected to accomplish tasks specified in 1) a reference video, a.k.a. one-shot demonstration; 2) a textual instruction; 3) an expected return. Canonical goal-conditioned imitation learning pipelines require strong supervision (labeled data) in the form of $\langle \tau, c\rangle$ ($\tau$ denotes a trajectory $(s_1, a_1, \dots)$ and $c$ denotes an instruction) from \textit{all} modalities, which can be hard to obtain. To this end, we propose \agent to learn from mostly unlabeled data $\tau$ plus a relatively small amount of data with strong supervision $\langle \tau, c\rangle$. The key idea is a novel algorithm to learn a shared intention space from the trajectories $\tau$ themselves and labels $c$, \ie, \textit{semi-supervised learning}. We evaluate \agent on various benchmarks including open-world Minecraft, Atari games, and robotic manipulation and it has demonstrated strong steerability and performance on these tasks.

Chat is not available.