Poster
in
Workshop: 1st ICML Workshop on In-Context Learning (ICL @ ICML 2024)

In-Context Reinforcement Learning Without Optimal Action Labels

Juncheng Dong · Moyang Guo · Ethan Fang · Zhuoran Yang · Vahid Tarokh

Project Page [ OpenReview]

Abstract

Large language models (LLMs) have achieved remarkable empirical successes, largely due to their in-context learning capabilities. Inspired by this, we explore training an autoregressive transformer for in-context Reinforcement Learning (RL). In this setting, we initially train a transformer on an offline dataset consisting of trajectories collected from various RL instances, and then fix and use this transformer to create an action policy for new RL instances. We consider the setting where the offline dataset contains trajectories sampled from suboptimal behavioral policies. In this case, standard autoregressive training corresponds to imitation learning and results in suboptimal performance. To address this, we propose the Decision Importance Transformer (DIT), which emulates the actor-critic algorithm in an in-context manner. DIT trains a transformer-based policy using a weighted maximum likelihood estimation (WMLE) loss, where the weights are based on the observed rewards and act as importance sampling ratios, guiding the suboptimal policy toward the optimal policy. We conduct extensive experiments to test the performance of DIT on both bandit and Markov Decision Process problems. Our results show that DIT achieves superior performance, particularly when the pretraining dataset contains suboptimal action labels.

Chat is not available.