ICML GROOT-1.5: Learning to Follow Multi-Modal Instructions from Weak Supervision

Poster
in
Workshop: Multi-modal Foundation Model meets Embodied AI (MFM-EAI)

GROOT-1.5: Learning to Follow Multi-Modal Instructions from Weak Supervision

Shaofei Cai · Bowei Zhang · Zihao Wang · Xiaojian Ma · Anji Liu · Yitao Liang

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract: This paper studies the problem of learning an agent policy that can follow various forms of instructions. Specifically, we focus on multi-modal instructions: the policy is expected to accomplish tasks specified in 1) a reference video, a.k.a. one-shot demonstration; 2) a textual instruction; 3) an expected return. Canonical goal-conditioned imitation learning pipelines require strong supervision (labeled data) in the form of

⟨ τ, c ⟩

$\langle \tau, c\rangle$ (

τ

$\tau$ denotes a trajectory

(s_{1}, a_{1}, \dots)

$(s_1, a_1, \dots)$ and

c

$c$ denotes an instruction) from \textit{all} modalities, which can be hard to obtain. To this end, we propose \agent to learn from mostly unlabeled data

τ

$\tau$ plus a relatively small amount of data with strong supervision

⟨ τ, c ⟩

$\langle \tau, c\rangle$ . The key idea is a novel algorithm to learn a shared intention space from the trajectories

τ

$\tau$ themselves and labels

c

$c$ , \ie, \textit{semi-supervised learning}. We evaluate \agent on various benchmarks including open-world Minecraft, Atari games, and robotic manipulation and it has demonstrated strong steerability and performance on these tasks.

Chat is not available.

Poster in Workshop: Multi-modal Foundation Model meets Embodied AI (MFM-EAI)

GROOT-1.5: Learning to Follow Multi-Modal Instructions from Weak Supervision

Shaofei Cai · Bowei Zhang · Zihao Wang · Xiaojian Ma · Anji Liu · Yitao Liang

Poster
in
Workshop: Multi-modal Foundation Model meets Embodied AI (MFM-EAI)