Poster
in
Workshop: Structured Probabilistic Inference and Generative Modeling
A Generative Model for Text Control in Minecraft
Shalev Lifshitz · Keiran Paster · Harris Chan · Jimmy Ba · Sheila McIlraith
Keywords: [ sequence models ] [ Reinforcement Learning ] [ foundation models ] [ minecraft ] [ Deep Learning ] [ text conditioned reinforcement learning ] [ sequential decision making ] [ instruction following ] [ transformers ] [ goal conditioned reinforcement learning ]
Constructing AI models that respond to text instructions is challenging, especially for (multi-modal) sequential decision-making tasks. This study introduces an instruction-tuned Video Pretraining (VPT) model for Minecraft called STEVE-1, demonstrating that the unCLIP approach, utilized in DALL•E 2, is also effective for creating instruction-following sequential decision-making agents. STEVE-1 is trained in two steps: adapting the pretrained VPT model to follow commands in MineCLIP's latent space, then training a prior to predict latent codes from text. This allows us to finetune VPT through self-supervised behavioral cloning and hindsight relabeling, bypassing the need for costly human text annotations. By leveraging pretrained models like VPT and MineCLIP and employing best practices from text-conditioned image generation, STEVE-1 costs just $60 to train and can follow nearly any short-horizon open-ended text and visual task in Minecraft. We provide experimental evidence highlighting key factors for downstream performance, including pretraining, classifier-free guidance, and data scaling. All resources, including our model weights, datasets, and evaluation tools, are made available for further research.