Skip to yearly menu bar Skip to main content


Poster

Bridging Environments and Language with Rendering Functions and Vision-Language Models

Théo Cachet · Christopher Dance · Olivier Sigaud


Abstract:

Vision-language models (VLMs) have tremendous potential for grounding language, and thus enabling language-conditioned agents (LCAs) to perform diverse tasks specified with text. This has motivated the study of LCAs based on reinforcement learning (RL) with rewards given by rendering images of an environment and evaluating those images with VLMs. If single-task RL is employed, such approaches are limited by the huge cost of evaluating the VLM many times, to train a policy for each new task. Multi-task RL (MTRL) is a natural alternative, but MTRL does not always generalize reliably to new tasks. Therefore, this paper compares an MTRL approach with a novel decomposition of the problem: first we find a configuration (e.g., the position rather than velocity components of the state) that has a high VLM score for text describing a task; then we use goal-conditioned reinforcement learning (GCRL) to reach that configuration. We also explore several enhancements to the speed and quality of VLM- based LCAs, notably, the use of distilled models and the evaluation of configurations from multiple viewpoints to resolve the ambiguities inherent in a single 2D view. We demonstrate our approach on the Humanoid environment, showing that it results in LCAs that act on text in real-time and excel at a wide range of previously unseen tasks, without requiring any textual task descriptions or other forms of environment-specific annotation during training.

Live content is unavailable. Log in and register to view live content