Timezone: »
Oral
From Dirichlet to Rubin: Optimistic Exploration in RL without Bonuses
Daniil Tiapkin · Denis Belomestny · Eric Moulines · Alexey Naumov · Sergey Samsonov · Yunhao Tang · Michal Valko · Pierre MENARD
We propose the Bayes-UCBVI algorithm for reinforcement learning in tabular, stage-dependent, episodic Markov decision process: a natural extension of the Bayes-UCB algorithm by Kaufmann et al. 2012 for multi-armed bandits. Our method uses the quantile of a Q-value function posterior as upper confidence bound on the optimal Q-value function. For Bayes-UCBVI, we prove a regret bound of order $\tcO(\sqrt{H^3SAT})$ where $H$ is the length of one episode, $S$ is the number of states, $A$ the number of actions, $T$ the number of episodes, that matches the lower-bound of $\Omega(\sqrt{H^3SAT})$ up to poly-$\log$ terms in $H,S,A,T$ for a large enough $T$. To the best of our knowledge, this is the first algorithm that obtains an optimal dependence on the horizon $H$ (and $S$) \textit{without the need of an involved Bernstein-like bonus or noise.} Crucial to our analysis is a new fine-grained anti-concentration bound for a weighted Dirichlet sum that can be of independent interest. We then explain how Bayes-UCBVI can be easily extended beyond the tabular setting, exhibiting a strong link between our algorithm and Bayesian bootstrap (Rubin,1981).
Author Information
Daniil Tiapkin (HSE University)
Denis Belomestny (Universitaet Duisburg-Essen)
Eric Moulines (Ecole Polytechnique)
Alexey Naumov (National Research University Higher School of Economics)
Sergey Samsonov (National Research University Higher School of Economics)
Yunhao Tang (DeepMind)
Michal Valko (DeepMind / Inria / ENS Paris-Saclay)
Pierre MENARD (OvGU)
Related Events (a corresponding poster, oral, or spotlight)
-
2022 Poster: From Dirichlet to Rubin: Optimistic Exploration in RL without Bonuses »
Dates n/a. Room None
More from the Same Authors
-
2021 : Marginalized Operators for Off-Policy Reinforcement Learning »
Yunhao Tang · Mark Rowland · Remi Munos · Michal Valko -
2021 : Model-Free Approach to Evaluate Reinforcement Learning Algorithms »
Denis Belomestny · Ilya Levin · Eric Moulines · Alexey Naumov · Sergey Samsonov · Veronika Zorina -
2021 : Stochastic Shortest Path: Minimax, Parameter-Free and Towards Horizon-Free Regret »
Jean Tarbouriech · Jean Tarbouriech · Simon Du · Matteo Pirotta · Michal Valko · Alessandro Lazaric -
2022 Poster: Biased Gradient Estimate with Drastic Variance Reduction for Meta Reinforcement Learning »
Yunhao Tang -
2022 Spotlight: Biased Gradient Estimate with Drastic Variance Reduction for Meta Reinforcement Learning »
Yunhao Tang -
2022 Poster: Scaling Gaussian Process Optimization by Evaluating a Few Unique Candidates Multiple Times »
Daniele Calandriello · Luigi Carratino · Alessandro Lazaric · Michal Valko · Lorenzo Rosasco -
2022 Spotlight: Scaling Gaussian Process Optimization by Evaluating a Few Unique Candidates Multiple Times »
Daniele Calandriello · Luigi Carratino · Alessandro Lazaric · Michal Valko · Lorenzo Rosasco -
2022 Poster: Retrieval-Augmented Reinforcement Learning »
Anirudh Goyal · Abe Friesen Friesen · Andrea Banino · Theophane Weber · Nan Rosemary Ke · Adrià Puigdomenech Badia · Arthur Guez · Mehdi Mirza · Peter Humphreys · Ksenia Konyushkova · Michal Valko · Simon Osindero · Timothy Lillicrap · Nicolas Heess · Charles Blundell -
2022 Spotlight: Retrieval-Augmented Reinforcement Learning »
Anirudh Goyal · Abe Friesen Friesen · Andrea Banino · Theophane Weber · Nan Rosemary Ke · Adrià Puigdomenech Badia · Arthur Guez · Mehdi Mirza · Peter Humphreys · Ksenia Konyushkova · Michal Valko · Simon Osindero · Timothy Lillicrap · Nicolas Heess · Charles Blundell -
2022 Poster: Diffusion bridges vector quantized variational autoencoders »
Max Cohen · Guillaume QUISPE · Sylvain Le Corff · Charles Ollion · Eric Moulines -
2022 Spotlight: Diffusion bridges vector quantized variational autoencoders »
Max Cohen · Guillaume QUISPE · Sylvain Le Corff · Charles Ollion · Eric Moulines -
2021 Poster: Monte Carlo Variational Auto-Encoders »
Achille Thin · Nikita Kotelevskii · Arnaud Doucet · Alain Durmus · Eric Moulines · Maxim Panov -
2021 Poster: Problem Dependent View on Structured Thresholding Bandit Problems »
James Cheshire · Pierre MENARD · Alexandra Carpentier -
2021 Spotlight: Problem Dependent View on Structured Thresholding Bandit Problems »
James Cheshire · Pierre MENARD · Alexandra Carpentier -
2021 Spotlight: Monte Carlo Variational Auto-Encoders »
Achille Thin · Nikita Kotelevskii · Arnaud Doucet · Alain Durmus · Eric Moulines · Maxim Panov -
2021 Poster: DG-LMC: A Turn-key and Scalable Synchronous Distributed MCMC Algorithm via Langevin Monte Carlo within Gibbs »
Vincent Plassier · Maxime Vono · Alain Durmus · Eric Moulines -
2021 Oral: DG-LMC: A Turn-key and Scalable Synchronous Distributed MCMC Algorithm via Langevin Monte Carlo within Gibbs »
Vincent Plassier · Maxime Vono · Alain Durmus · Eric Moulines -
2021 Poster: Fast active learning for pure exploration in reinforcement learning »
Pierre MENARD · Omar Darwiche Domingues · Anders Jonsson · Emilie Kaufmann · Edouard Leurent · Michal Valko -
2021 Poster: UCB Momentum Q-learning: Correcting the bias without forgetting »
Pierre MENARD · Omar Darwiche Domingues · Xuedong Shang · Michal Valko -
2021 Spotlight: Fast active learning for pure exploration in reinforcement learning »
Pierre MENARD · Omar Darwiche Domingues · Anders Jonsson · Emilie Kaufmann · Edouard Leurent · Michal Valko -
2021 Oral: UCB Momentum Q-learning: Correcting the bias without forgetting »
Pierre MENARD · Omar Darwiche Domingues · Xuedong Shang · Michal Valko -
2021 Poster: Kernel-Based Reinforcement Learning: A Finite-Time Analysis »
Omar Darwiche Domingues · Pierre Menard · Matteo Pirotta · Emilie Kaufmann · Michal Valko -
2021 Poster: Online A-Optimal Design and Active Linear Regression »
Xavier Fontaine · Pierre Perrault · Michal Valko · Vianney Perchet -
2021 Spotlight: Kernel-Based Reinforcement Learning: A Finite-Time Analysis »
Omar Darwiche Domingues · Pierre Menard · Matteo Pirotta · Emilie Kaufmann · Michal Valko -
2021 Spotlight: Online A-Optimal Design and Active Linear Regression »
Xavier Fontaine · Pierre Perrault · Michal Valko · Vianney Perchet -
2021 Poster: Revisiting Peng's Q($\lambda$) for Modern Reinforcement Learning »
Tadashi Kozuno · Yunhao Tang · Mark Rowland · Remi Munos · Steven Kapturowski · Will Dabney · Michal Valko · David Abel -
2021 Poster: Taylor Expansion of Discount Factors »
Yunhao Tang · Mark Rowland · Remi Munos · Michal Valko -
2021 Spotlight: Taylor Expansion of Discount Factors »
Yunhao Tang · Mark Rowland · Remi Munos · Michal Valko -
2021 Spotlight: Revisiting Peng's Q($\lambda$) for Modern Reinforcement Learning »
Tadashi Kozuno · Yunhao Tang · Mark Rowland · Remi Munos · Steven Kapturowski · Will Dabney · Michal Valko · David Abel -
2021 Poster: Counterfactual Credit Assignment in Model-Free Reinforcement Learning »
Thomas Mesnard · Theophane Weber · Fabio Viola · Shantanu Thakoor · Alaa Saade · Anna Harutyunyan · Will Dabney · Thomas Stepleton · Nicolas Heess · Arthur Guez · Eric Moulines · Marcus Hutter · Lars Buesing · Remi Munos -
2021 Spotlight: Counterfactual Credit Assignment in Model-Free Reinforcement Learning »
Thomas Mesnard · Theophane Weber · Fabio Viola · Shantanu Thakoor · Alaa Saade · Anna Harutyunyan · Will Dabney · Thomas Stepleton · Nicolas Heess · Arthur Guez · Eric Moulines · Marcus Hutter · Lars Buesing · Remi Munos -
2020 Poster: Monte-Carlo Tree Search as Regularized Policy Optimization »
Jean-Bastien Grill · Florent Altché · Yunhao Tang · Thomas Hubert · Michal Valko · Ioannis Antonoglou · Remi Munos -
2020 Poster: Improved Sleeping Bandits with Stochastic Action Sets and Adversarial Rewards »
Aadirupa Saha · Pierre Gaillard · Michal Valko -
2020 Poster: Gamification of Pure Exploration for Linear Bandits »
Rémy Degenne · Pierre Menard · Xuedong Shang · Michal Valko -
2020 Poster: Stochastic bandits with arm-dependent delays »
Anne Gael Manegueu · Claire Vernade · Alexandra Carpentier · Michal Valko -
2020 Poster: Fast and Consistent Learning of Hidden Markov Models by Incorporating Non-Consecutive Correlations »
Robert Mattila · Cristian R. Rojas · Eric Moulines · Vikram Krishnamurthy · Bo Wahlberg -
2020 Poster: Learning to Score Behaviors for Guided Policy Optimization »
Aldo Pacchiano · Jack Parker-Holder · Yunhao Tang · Krzysztof Choromanski · Anna Choromanska · Michael Jordan -
2020 Poster: Budgeted Online Influence Maximization »
Pierre Perrault · Jennifer Healey · Zheng Wen · Michal Valko -
2020 Poster: Reinforcement Learning for Integer Programming: Learning to Cut »
Yunhao Tang · Shipra Agrawal · Yuri Faenza -
2020 Poster: Near-linear time Gaussian process optimization with adaptive batching and resparsification »
Daniele Calandriello · Luigi Carratino · Alessandro Lazaric · Michal Valko · Lorenzo Rosasco -
2020 Poster: Taylor Expansion Policy Optimization »
Yunhao Tang · Michal Valko · Remi Munos -
2019 : poster session I »
Nicholas Rhinehart · Yunhao Tang · Vinay Prabhu · Dian Ang Yap · Alexander Wang · Marc Finzi · Manoj Kumar · You Lu · Abhishek Kumar · Qi Lei · Michael Przystupa · Nicola De Cao · Polina Kirichenko · Pavel Izmailov · Andrew Wilson · Jakob Kruse · Diego Mesquita · Mario Lezcano Casado · Thomas Müller · Keir Simmons · Andrei Atanov