ICML 2022 Learning Stochastic Shortest Path with Linear Function Approximation Spotlight

Spotlight

Learning Stochastic Shortest Path with Linear Function Approximation

Yifei Min · Jiafan He · Tianhao Wang · Quanquan Gu

Room 310

[ Abstract ] [ Visit Optimization/Reinforcement Learning ]

[ Slides] [ Paper PDF]

Abstract: We study the stochastic shortest path (SSP) problem in reinforcement learning with linear function approximation, where the transition kernel is represented as a linear mixture of unknown models. We call this class of SSP problems as linear mixture SSPs. We propose a novel algorithm with Hoeffding-type confidence sets for learning the linear mixture SSP, which can attain an

\tilde{O} (d B_{⋆}^{1.5} \sqrt{K / c_{min}})

$\tilde{\mathcal{O}}(d B_{\star}^{1.5}\sqrt{K/c_{\min}})$ regret. Here

K

$K$ is the number of episodes,

d

$d$ is the dimension of the feature mapping in the mixture model,

B_{⋆}

$B_{\star}$ bounds the expected cumulative cost of the optimal policy, and

c_{min} > 0

$c_{\min}>0$ is the lower bound of the cost function.Our algorithm also applies to the case when

c_{min} = 0

$c_{\min} = 0$ , and an

\tilde{O} (K^{2 / 3})

$\tilde{\mathcal{O}}(K^{2/3})$ regret is guaranteed. To the best of our knowledge, this is the first algorithm with a sublinear regret guarantee for learning linear mixture SSP. Moreover, we design a refined Bernstein-type confidence set and propose an improved algorithm, which provably achieves an

\tilde{O} (d B_{⋆} \sqrt{K / c_{min}})

$\tilde{\mathcal{O}}(d B_{\star}\sqrt{K/c_{\min}})$ regret.In complement to the regret upper bounds, we also prove a lower bound of

Ω (d B_{⋆} \sqrt{K})

$\Omega(dB_{\star} \sqrt{K})$ . Hence, our improved algorithm matches the lower bound up to a

1 / \sqrt{c_{min}}

$1/\sqrt{c_{\min}}$ factor and poly-logarithmic factors, achieving a near-optimal regret guarantee.

Chat is not available.