Talk
Minimax Regret Bounds for Reinforcement Learning
Mohammad Gheshlaghi Azar · Ian Osband · Remi Munos
C4.5
[
Abstract
]
Abstract:
We consider the problem of provably optimal exploration in reinforcement learning for finite horizon MDPs.
We show that an optimistic modification to value iteration achieves a regret bound of $\tilde {O}( \sqrt{HSAT} + H^2S^2A+H\sqrt{T})$ where $H$ is the time horizon, $S$ the number of states, $A$ the number of actions and $T$ the number of time-steps.
This result improves over the best previous known bound $\tilde {O}(HS \sqrt{AT})$ achieved by the UCRL2 algorithm.
The key significance of our new results is that when $T\geq H^3S^3A$ and $SA\geq H$, it leads to a regret of $\tilde{O}(\sqrt{HSAT})$ that matches the established lower bound of $\Omega(\sqrt{HSAT})$ up to a logarithmic factor.
Our analysis contain two key insights.
We use careful application of concentration inequalities to the optimal value function as a whole, rather than to the transitions probabilities (to improve scaling in $S$), and we use "exploration bonuses" built from Bernstein's inequality, together with using a recursive -Bellman-type- Law of Total Variance (to improve scaling in $H$).
Live content is unavailable. Log in and register to view live content