ICML Poster Improved No-Regret Algorithms for Stochastic Shortest Path with Linear MDP

Poster

Improved No-Regret Algorithms for Stochastic Shortest Path with Linear MDP

Liyu Chen · Rahul Jain · Haipeng Luo

Hall E #812

Keywords: [ RL: Total Cost/Reward ] [ T: Reinforcement Learning and Planning ] [ T: Online Learning and Bandits ] [ RL: Function Approximation ] [ RL: Online ] [ Reinforcement Learning ]

[ Abstract ]

[ Poster] [ Paper PDF]

Abstract: We introduce two new no-regret algorithms for the stochastic shortest path (SSP) problem with a linear MDP that significantly improve over the only existing results of (Vial et al., 2021).Our first algorithm is computationally efficient and achieves a regret bound

$O(\sqrt{d^3B_{\star}^2T_{\star} K})$ , where

$d$ is the dimension of the feature space,

$B_{\star}$ and

$T_{\star}$ are upper bounds of the expected costs and hitting time of the optimal policy respectively, and

$K$ is the number of episodes.The same algorithm with a slight modification also achieves logarithmic regret of order

$O(\frac{d^3B_{\star}^4}{c_{\min}^2\text{\rm gap}_{\min} }\ln^5\frac{dB_{\star} K}{c_{\min}})$ , where

$\text{\rm gap}_{\min}$ is the minimum sub-optimality gap and

$c_{\min}$ is the minimum cost over all state-action pairs.Our result is obtained by developing a simpler and improved analysis for the finite-horizon approximation of (Cohen et al., 2021) with a smaller approximation error, which might be of independent interest.On the other hand, using variance-aware confidence sets in a global optimization problem,our second algorithm is computationally inefficient but achieves the first

horizon-free'' regret bound

$O(d^{3.5}B_{\star}\sqrt{K})$ with no polynomial dependency on

$T_{\star}$ or

$1/c_{\min}$ ,almost matching the

$\Omega(dB_{\star}\sqrt{K})$ lower bound from (Min et al., 2021).

Chat is not available.