Timezone: »
Bandit and reinforcement learning (RL) problems can often be framed as optimization problems where the goal is to maximize average performance while having access only to stochastic estimates of the true gradient. Traditionally, stochastic optimization theory predicts that learning dynamics are governed by the curvature of the loss function and the noise of the gradient estimates. In this paper we demonstrate that the standard view is too limited for bandit and RL problems. To allow our analysis to be interpreted in light of multi-step MDPs, we focus on techniques derived from stochastic optimization principles~(e.g., natural policy gradient and EXP3) and we show that some standard assumptions from optimization theory are violated in these problems. We present theoretical results showing that, at least for bandit problems, curvature and noise are not sufficient to explain the learning dynamics and that seemingly innocuous choices like the baseline can determine whether an algorithm converges. These theoretical findings match our empirical evaluation, which we extend to multi-state MDPs.
Author Information
Wesley Chung (Mila / McGill University)
I am a second-year PhD student co-superivsed by Prof. David Meger and Prof. Doina Precup. My research interests lie in reinforcement learning and optimization.
Valentin Thomas (Mila)
Marlos C. Machado (DeepMind, University of Alberta)
Nicolas Le Roux (Google)
Related Events (a corresponding poster, oral, or spotlight)
-
2021 Spotlight: Beyond Variance Reduction: Understanding the True Impact of Baselines on Policy Optimization »
Wed. Jul 21st 02:40 -- 02:45 AM Room
More from the Same Authors
-
2021 : A functional mirror ascent view of policy gradient methods with function approximation »
Sharan Vaswani · Olivier Bachem · Simone Totaro · Matthieu Geist · Marlos C. Machado · Pablo Samuel Castro · Nicolas Le Roux -
2021 : Exploration-Driven Representation Learning in Reinforcement Learning »
Akram Erraqabi · Mingde Zhao · Marlos C. Machado · Yoshua Bengio · Sainbayar Sukhbaatar · Ludovic Denoyer · Alessandro Lazaric -
2023 : Decision-Aware Actor-Critic with Function Approximation and Theoretical Guarantees »
Sharan Vaswani · Amirreza Kazemi · Reza Babanezhad · Nicolas Le Roux -
2023 Poster: Target-based Surrogates for Stochastic Optimization »
Jonathan Lavington · Sharan Vaswani · Reza Babanezhad · Mark Schmidt · Nicolas Le Roux -
2019 : Poster discussion »
Roman Novak · Maxime Gabella · Frederic Dreyer · Siavash Golkar · Anh Tong · Irina Higgins · Mirco Milletari · Joe Antognini · Sebastian Goldt · Adín Ramírez Rivera · Roberto Bondesan · Ryo Karakida · Remi Tachet des Combes · Michael Mahoney · Nicholas Walker · Stanislav Fort · Samuel Smith · Rohan Ghosh · Aristide Baratin · Diego Granziol · Stephen Roberts · Dmitry Vetrov · Andrew Wilson · César Laurent · Valentin Thomas · Simon Lacoste-Julien · Dar Gilboa · Daniel Soudry · Anupam Gupta · Anirudh Goyal · Yoshua Bengio · Erich Elsen · Soham De · Stanislaw Jastrzebski · Charles H Martin · Samira Shabanian · Aaron Courville · Shorato Akaho · Lenka Zdeborova · Ethan Dyer · Maurice Weiler · Pim de Haan · Taco Cohen · Max Welling · Ping Luo · zhanglin peng · Nasim Rahaman · Loic Matthey · Danilo J. Rezende · Jaesik Choi · Kyle Cranmer · Lechao Xiao · Jaehoon Lee · Yasaman Bahri · Jeffrey Pennington · Greg Yang · Jiri Hron · Jascha Sohl-Dickstein · Guy Gur-Ari -
2019 Poster: Understanding the Impact of Entropy on Policy Optimization »
Zafarali Ahmed · Nicolas Le Roux · Mohammad Norouzi · Dale Schuurmans -
2019 Oral: Understanding the Impact of Entropy on Policy Optimization »
Zafarali Ahmed · Nicolas Le Roux · Mohammad Norouzi · Dale Schuurmans -
2019 Poster: The Value Function Polytope in Reinforcement Learning »
Robert Dadashi · Marc Bellemare · Adrien Ali Taiga · Nicolas Le Roux · Dale Schuurmans -
2019 Oral: The Value Function Polytope in Reinforcement Learning »
Robert Dadashi · Marc Bellemare · Adrien Ali Taiga · Nicolas Le Roux · Dale Schuurmans