Timezone: »
Oral
Safe Policy Improvement with Baseline Bootstrapping
Romain Laroche · Paul TRICHELAIR · Remi Tachet des Combes
This paper considers Safe Policy Improvement (SPI) in Batch Reinforcement Learning (Batch RL): from a fixed dataset and without direct access to the true environment, train a policy that is guaranteed to perform at least as well as the baseline policy used to collect the data.
Our approach, called SPI with Baseline Bootstrapping (SPIBB), is inspired by the knows-what-it-knows paradigm: it bootstraps the trained policy with the baseline when the uncertainty is high.
Our first algorithm, $\Pi_b$-SPIBB, comes with SPI theoretical guarantees.
We also implement a variant, $\Pi_{\leq b}$-SPIBB, that is even more efficient in practice.
We apply our algorithms to a motivational stochastic gridworld domain and further demonstrate on randomly generated MDPs the superiority of SPIBB with respect to existing algorithms, not only in safety but also in mean performance.
Finally, we implement a model-free version of SPIBB and show its benefits on a navigation task with deep RL implementation called SPIBB-DQN, which is, to the best of our knowledge, the first RL algorithm relying on a neural network representation able to train efficiently and reliably from batch data, without any interaction with the environment.
Author Information
Romain Laroche (Microsoft Research)
Paul TRICHELAIR (Mila - Quebec AI Institute/McGill University)
Remi Tachet des Combes (Microsoft Research Montreal)
Related Events (a corresponding poster, oral, or spotlight)
-
2019 Poster: Safe Policy Improvement with Baseline Bootstrapping »
Wed. Jun 12th 01:30 -- 04:00 AM Room Pacific Ballroom #101
More from the Same Authors
-
2023 Poster: On the Convergence of SARSA with Linear Function Approximation »
Shangtong Zhang · Remi Tachet des Combes · Romain Laroche -
2023 Poster: On the Occupancy Measure of Non-Markovian Policies in Continuous MDPs »
Romain Laroche · Remi Tachet des Combes -
2021 Poster: Decomposed Mutual Information Estimation for Contrastive Representation Learning »
Alessandro Sordoni · Nouha Dziri · Hannes Schulz · Geoff Gordon · Philip Bachman · Remi Tachet des Combes -
2021 Spotlight: Decomposed Mutual Information Estimation for Contrastive Representation Learning »
Alessandro Sordoni · Nouha Dziri · Hannes Schulz · Geoff Gordon · Philip Bachman · Remi Tachet des Combes -
2019 : Poster discussion »
Roman Novak · Maxime Gabella · Frederic Dreyer · Siavash Golkar · Anh Tong · Irina Higgins · Mirco Milletari · Joe Antognini · Sebastian Goldt · Adín Ramírez Rivera · Roberto Bondesan · Ryo Karakida · Remi Tachet des Combes · Michael Mahoney · Nicholas Walker · Stanislav Fort · Samuel Smith · Rohan Ghosh · Aristide Baratin · Diego Granziol · Stephen Roberts · Dmitry Vetrov · Andrew Wilson · César Laurent · Valentin Thomas · Simon Lacoste-Julien · Dar Gilboa · Daniel Soudry · Anupam Gupta · Anirudh Goyal · Yoshua Bengio · Erich Elsen · Soham De · Stanislaw Jastrzebski · Charles H Martin · Samira Shabanian · Aaron Courville · Shorato Akaho · Lenka Zdeborova · Ethan Dyer · Maurice Weiler · Pim de Haan · Taco Cohen · Max Welling · Ping Luo · zhanglin peng · Nasim Rahaman · Loic Matthey · Danilo J. Rezende · Jaesik Choi · Kyle Cranmer · Lechao Xiao · Jaehoon Lee · Yasaman Bahri · Jeffrey Pennington · Greg Yang · Jiri Hron · Jascha Sohl-Dickstein · Guy Gur-Ari -
2019 : Convergence Properties of Neural Networks on Separable Data »
Remi Tachet des Combes