Skip to yearly menu bar Skip to main content


Talk
in
Workshop: Theoretical Foundations of Reinforcement Learning

Exploration, Policy Gradient Methods, and the Deadly Triad - Sham Kakade

Sham Kakade


Abstract:

Practical reinforcement learning algorithms often face the "deadly triad" [Rich Sutton, 2015]: function approximation, data efficiency (e.g. by bootstrapping value function estimates), and exploration (e.g. by off-policy learning). Algorithms which address two without the third are often ok, while trying to address all three leads to highly unstable algorithms in practice. This talk considers a policy gradient approach to alleviate these issues. In particular, we introduce the Policy Cover-Policy Gradient (PC-PG) algorithm, which provably balances the exploration vs. exploitation tradeoff, with polynomial sample complexity, using an ensemble of learned policies (the policy cover). We quantify how the relevant notion of function approximation is based on an approximation error term, under distribution shift. Furthermore, we will provide simple examples where a number of standard (and provable) RL approaches are less robust when it comes to function approximation. Time permitting, we will discuss the implications this has for more effective data re-use.

Joint work with: Alekh Agarwal, Mikael Henaff, and Wen Sun.

Chat is not available.