Poster
The Mirage of Action-Dependent Baselines in Reinforcement Learning
George Tucker · Surya Bhupatiraju · Shixiang Gu · Richard E Turner · Zoubin Ghahramani · Sergey Levine

Thu Jul 12th 06:15 -- 09:00 PM @ Hall B #30

Policy gradient methods are a widely used class of model-free reinforcement learning algorithms where a state-dependent baseline is used to reduce gradient estimator variance. Several recent papers extend the baseline to depend on both the state and action and suggest that this significantly reduces variance and improves sample efficiency without introducing bias into the gradient estimates. To better understand this development, we decompose the variance of the policy gradient estimator and numerically show that learned state-action-dependent baselines do not in fact reduce variance over a state-dependent baseline in commonly tested benchmark domains. We confirm this unexpected result by reviewing the open-source code accompanying these prior papers, and show that subtle implementation decisions cause deviations from the methods presented in the papers and explain the source of the previously observed empirical gains. Furthermore, the variance decomposition highlights areas for improvement, which we demonstrate by illustrating a simple change to the typical value function parameterization that can significantly improve performance.

Author Information

George Tucker (Google Brain)
Surya Bhupatiraju (Google Brain)
Shixiang Gu (Cambridge)
Richard E Turner (University of Cambridge)

Richard Turner holds a Lectureship (equivalent to US Assistant Professor) in Computer Vision and Machine Learning in the Computational and Biological Learning Lab, Department of Engineering, University of Cambridge, UK. He is a Fellow of Christ's College Cambridge. Previously, he held an EPSRC Postdoctoral research fellowship which he spent at both the University of Cambridge and the Laboratory for Computational Vision, NYU, USA. He has a PhD degree in Computational Neuroscience and Machine Learning from the Gatsby Computational Neuroscience Unit, UCL, UK and a M.Sci. degree in Natural Sciences (specialism Physics) from the University of Cambridge, UK. His research interests include machine learning, signal processing and developing probabilistic models of perception.

Zoubin Ghahramani (University of Cambridge & Uber)

Zoubin Ghahramani is a Professor at the University of Cambridge, and Chief Scientist at Uber. He is also Deputy Director of the Leverhulme Centre for the Future of Intelligence, was a founding Director of the Alan Turing Institute and co-founder of Geometric Intelligence (now Uber AI Labs). His research focuses on probabilistic approaches to machine learning and AI. In 2015 he was elected a Fellow of the Royal Society.

Sergey Levine (Berkeley)
Sergey Levine

Sergey Levine received a BS and MS in Computer Science from Stanford University in 2009, and a Ph.D. in Computer Science from Stanford University in 2014. He joined the faculty of the Department of Electrical Engineering and Computer Sciences at UC Berkeley in fall 2016. His work focuses on machine learning for decision making and control, with an emphasis on deep learning and reinforcement learning algorithms. Applications of his work include autonomous robots and vehicles, as well as computer vision and graphics. His research includes developing algorithms for end-to-end training of deep neural network policies that combine perception and control, scalable algorithms for inverse reinforcement learning, deep reinforcement learning algorithms, and more.

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors