Poster
in
Workshop: Interactive Learning with Implicit Human Feedback
Bayesian Inverse Transition Learning for Offline Settings
Leo Benac · Sonali Parbhoo · Finale Doshi-Velez
Offline Reinforcement learning is commonly usedfor sequential decision-making in domains suchas healthcare and education, where the rewardsare known and the transition dynamics T mustbe estimated on the basis of batch data. A keychallenge for all tasks is how to learn a reliable estimateof the transition dynamics T that producenear-optimal policies that are safe enough so thatthey never take actions that are far away from thebest action with respect to their value functionsand informative enough so that they communicatethe uncertainties they have. Using an expert’sfeedback, we propose a new constraint-based approachthat captures our desiderata for reliablylearning a posterior distribution of the transitiondynamics T that is free from gradients. Our resultsdemonstrate that by using our constraints,we learn a high-performing policy, while considerablyreducing the policy’s variance over differentdatasets. We also explain how combining uncertaintyestimation with these constraints can helpus infer a partial ranking of actions that producehigher returns, and helps us infer safer and moreinformative policies for planning.