Timezone: »
Actor-critic (AC) methods are widely used in reinforcement learning (RL), and benefit from the flexibility of using any policy gradient method as the actor and value-based method as the critic. The critic is usually trained by minimizing the TD error, an objective that is potentially decorrelated with the true goal of achieving a high reward with the actor. We address this mismatch by designing a joint objective for training the actor and critic in a decision-aware fashion. We use the proposed objective to design a generic, AC algorithm that can easily handle any function approximation. We explicitly characterize the conditions under which the resulting algorithm guarantees monotonic policy improvement, regardless of the choice of the policy and critic parameterization. Instantiating the generic algorithm results in an actor that involves maximizing a sequence of surrogate functions (similar to TRPO, PPO), and a critic that involves minimizing a closely connected objective. Using simple bandit examples, we provably establish the benefit of the proposed critic objective over the standard squared error. Finally, we empirically demonstrate the benefit of our decision-aware actor-critic framework on simple RL problems.
Author Information
Sharan Vaswani (Simon Fraser University)
Amirreza Kazemi (Simon Fraser University)
Reza Babanezhad (Samsung)
Nicolas Le Roux (Microsoft)
More from the Same Authors
-
2021 : A functional mirror ascent view of policy gradient methods with function approximation »
Sharan Vaswani · Olivier Bachem · Simone Totaro · Matthieu Geist · Marlos C. Machado · Pablo Samuel Castro · Nicolas Le Roux -
2023 Poster: Fast Online Node Labeling for Very Large Graphs »
Baojian Zhou · Yifan Sun · Reza Babanezhad -
2023 Poster: Target-based Surrogates for Stochastic Optimization »
Jonathan Lavington · Sharan Vaswani · Reza Babanezhad · Mark Schmidt · Nicolas Le Roux -
2022 Poster: Towards Noise-adaptive, Problem-adaptive (Accelerated) Stochastic Gradient Descent »
Sharan Vaswani · Benjamin Dubois-Taine · Reza Babanezhad -
2022 Oral: Towards Noise-adaptive, Problem-adaptive (Accelerated) Stochastic Gradient Descent »
Sharan Vaswani · Benjamin Dubois-Taine · Reza Babanezhad -
2021 Poster: Infinite-Dimensional Optimization for Zero-Sum Games via Variational Transport »
Lewis Liu · Yufeng Zhang · Zhuoran Yang · Reza Babanezhad · Zhaoran Wang -
2021 Spotlight: Infinite-Dimensional Optimization for Zero-Sum Games via Variational Transport »
Lewis Liu · Yufeng Zhang · Zhuoran Yang · Reza Babanezhad · Zhaoran Wang -
2021 Poster: Beyond Variance Reduction: Understanding the True Impact of Baselines on Policy Optimization »
Wesley Chung · Valentin Thomas · Marlos C. Machado · Nicolas Le Roux -
2021 Spotlight: Beyond Variance Reduction: Understanding the True Impact of Baselines on Policy Optimization »
Wesley Chung · Valentin Thomas · Marlos C. Machado · Nicolas Le Roux -
2019 Poster: Understanding the Impact of Entropy on Policy Optimization »
Zafarali Ahmed · Nicolas Le Roux · Mohammad Norouzi · Dale Schuurmans -
2019 Oral: Understanding the Impact of Entropy on Policy Optimization »
Zafarali Ahmed · Nicolas Le Roux · Mohammad Norouzi · Dale Schuurmans -
2019 Poster: The Value Function Polytope in Reinforcement Learning »
Robert Dadashi · Marc Bellemare · Adrien Ali Taiga · Nicolas Le Roux · Dale Schuurmans -
2019 Oral: The Value Function Polytope in Reinforcement Learning »
Robert Dadashi · Marc Bellemare · Adrien Ali Taiga · Nicolas Le Roux · Dale Schuurmans