Timezone: »

Exploration, Policy Gradient Methods, and the Deadly Triad - Sham Kakade
Sham Kakade

Fri Jul 17 06:30 AM -- 07:15 AM (PDT) @ None

Practical reinforcement learning algorithms often face the "deadly triad" [Rich Sutton, 2015]: function approximation, data efficiency (e.g. by bootstrapping value function estimates), and exploration (e.g. by off-policy learning). Algorithms which address two without the third are often ok, while trying to address all three leads to highly unstable algorithms in practice. This talk considers a policy gradient approach to alleviate these issues. In particular, we introduce the Policy Cover-Policy Gradient (PC-PG) algorithm, which provably balances the exploration vs. exploitation tradeoff, with polynomial sample complexity, using an ensemble of learned policies (the policy cover). We quantify how the relevant notion of function approximation is based on an approximation error term, under distribution shift. Furthermore, we will provide simple examples where a number of standard (and provable) RL approaches are less robust when it comes to function approximation. Time permitting, we will discuss the implications this has for more effective data re-use.

Joint work with: Alekh Agarwal, Mikael Henaff, and Wen Sun.

Author Information

Sham Kakade (University of Washington)

Sham Kakade is a Washington Research Foundation Data Science Chair, with a joint appointment in the Department of Computer Science and the Department of Statistics at the University of Washington, and is a co-director for the Algorithmic Foundations of Data Science Institute. He works on the mathematical foundations of machine learning and AI. Sham's thesis helped in laying the foundations of the PAC-MDP framework for reinforcement learning. With his collaborators, his additional contributions include: one of the first provably efficient policy search methods, Conservative Policy Iteration, for reinforcement learning; developing the mathematical foundations for the widely used linear bandit models and the Gaussian process bandit models; the tensor and spectral methodologies for provable estimation of latent variable models (applicable to mixture of Gaussians, HMMs, and LDA); the first sharp analysis of the perturbed gradient descent algorithm, along with the design and analysis of numerous other convex and non-convex algorithms. He is the recipient of the IBM Goldberg best paper award (in 2007) for contributions to fast nearest neighbor search and the best paper, INFORMS Revenue Management and Pricing Section Prize (2014). He has been program chair for COLT 2011. Sham was an undergraduate at Caltech, where he studied physics and worked under the guidance of John Preskill in quantum computing. He then completed his Ph.D. in computational neuroscience at the Gatsby Unit at University College London, under the supervision of Peter Dayan. He was a postdoc at the Dept. of Computer Science, University of Pennsylvania , where he broadened his studies to include computational game theory and economics from the guidance of Michael Kearns. Sham has been a Principal Research Scientist at Microsoft Research, New England, an associate professor at the Department of Statistics, Wharton, UPenn, and an assistant professor at the Toyota Technological Institute at Chicago.

More from the Same Authors