Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Multi-modal Foundation Model meets Embodied AI (MFM-EAI)

GROOT-1.5: Learning to Follow Multi-Modal Instructions from Weak Supervision

Shaofei Cai · Bowei Zhang · Zihao Wang · Xiaojian Ma · Anji Liu · Yitao Liang


Abstract: This paper studies the problem of learning an agent policy that can follow various forms of instructions. Specifically, we focus on multi-modal instructions: the policy is expected to accomplish tasks specified in 1) a reference video, a.k.a. one-shot demonstration; 2) a textual instruction; 3) an expected return. Canonical goal-conditioned imitation learning pipelines require strong supervision (labeled data) in the form of τ,c (τ denotes a trajectory (s1,a1,) and c denotes an instruction) from \textit{all} modalities, which can be hard to obtain. To this end, we propose \agent to learn from mostly unlabeled data τ plus a relatively small amount of data with strong supervision τ,c. The key idea is a novel algorithm to learn a shared intention space from the trajectories τ themselves and labels c, \ie, \textit{semi-supervised learning}. We evaluate \agent on various benchmarks including open-world Minecraft, Atari games, and robotic manipulation and it has demonstrated strong steerability and performance on these tasks.

Chat is not available.