Timezone: »

 
Oral
Offline RL Policies Should Be Trained to be Adaptive
Dibya Ghosh · Anurag Ajay · Pulkit Agrawal · Sergey Levine

Tue Jul 19 11:05 AM -- 11:25 AM (PDT) @ Room 309

Offline RL algorithms must learn policies that achieve high return, given only a fixed dataset of experience from an environment. Successfully doing so requires grappling with partial specification, in that the limited set of transitions leaves facets of the environment unknown. The most common way to approach this challenge is to employ pessimistic or conservative methods, which avoid behaviors that are too dissimilar from those in the training dataset. However, relying exclusively on conservatism has drawbacks: performance is sensitive to the exact degree of conservatism, and conservative objectives can recover highly suboptimal policies.In this work, we propose that offline RL methods should instead be adaptive in the presence of uncertainty. This requires learning policies that at test-time condition their behavior on all the transitions seen so far during evaluation, not only the current state. We show that adaptive policies are optimal for offline RL in a Bayesian sense, and discuss exactly how policies must adapt to be optimal. We present a model-free algorithm for approximating this optimal adaptive policy, and demonstrate the efficacy of learning policies with this adaptation mechanism in several offline RL benchmarks.

Author Information

Dibya Ghosh (Google)
Anurag Ajay (MIT)
Pulkit Agrawal (MIT)
Sergey Levine (UC Berkeley)
Sergey Levine

Sergey Levine received a BS and MS in Computer Science from Stanford University in 2009, and a Ph.D. in Computer Science from Stanford University in 2014. He joined the faculty of the Department of Electrical Engineering and Computer Sciences at UC Berkeley in fall 2016. His work focuses on machine learning for decision making and control, with an emphasis on deep learning and reinforcement learning algorithms. Applications of his work include autonomous robots and vehicles, as well as computer vision and graphics. His research includes developing algorithms for end-to-end training of deep neural network policies that combine perception and control, scalable algorithms for inverse reinforcement learning, deep reinforcement learning algorithms, and more.

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors