Timezone: »
SARSA, a classical on-policy control algorithm for reinforcement learning, is known to chatter when combined with linear function approximation: SARSA does not diverge but oscillates in a bounded region. However, little is known about how fast SARSA converges to that region and how large the region is. In this paper, we make progress towards this open problem by showing the convergence rate of projected SARSA to a bounded region. Importantly, the region is much smaller than the region that we project into, provided that the the magnitude of the reward is not too large. Existing works regarding the convergence of linear SARSA to a fixed point all require the Lipschitz constant of SARSA's policy improvement operator to be sufficiently small; our analysis instead applies to arbitrary Lipschitz constants and thus characterizes the behavior of linear SARSA for a new regime.
Author Information
Shangtong Zhang (University of Virginia)
Remi Tachet des Combes (AlpacaML)
Romain Laroche (None)
More from the Same Authors
-
2023 Poster: Principled Offline RL in the Presence of Rich Exogenous Information »
Riashat Islam · Manan Tomar · Alex Lamb · Yonathan Efroni · Hongyu Zang · Aniket Didolkar · Dipendra Misra · Xin Li · Harm Seijen · Remi Tachet des Combes · John Langford -
2023 Poster: On the Occupancy Measure of Non-Markovian Policies in Continuous MDPs »
Romain Laroche · Remi Tachet des Combes -
2021 Poster: Average-Reward Off-Policy Policy Evaluation with Function Approximation »
Shangtong Zhang · Yi Wan · Richard Sutton · Shimon Whiteson -
2021 Spotlight: Average-Reward Off-Policy Policy Evaluation with Function Approximation »
Shangtong Zhang · Yi Wan · Richard Sutton · Shimon Whiteson -
2021 Spotlight: Breaking the Deadly Triad with a Target Network »
Shangtong Zhang · Hengshuai Yao · Shimon Whiteson -
2021 Poster: Breaking the Deadly Triad with a Target Network »
Shangtong Zhang · Hengshuai Yao · Shimon Whiteson -
2020 Poster: Provably Convergent Two-Timescale Off-Policy Actor-Critic with Function Approximation »
Shangtong Zhang · Bo Liu · Hengshuai Yao · Shimon Whiteson -
2020 Poster: GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values »
Shangtong Zhang · Bo Liu · Shimon Whiteson -
2019 Poster: Safe Policy Improvement with Baseline Bootstrapping »
Romain Laroche · Paul TRICHELAIR · Remi Tachet des Combes -
2019 Oral: Safe Policy Improvement with Baseline Bootstrapping »
Romain Laroche · Paul TRICHELAIR · Remi Tachet des Combes