Efficiently Solving Discounted MDPs via Predictions with Unknown Prediction Errors
Lixing Lyu ⋅ Jiashuo Jiang ⋅ Wang Chi Cheung
Abstract
We study infinite-horizon discounted Markov decision processes (DMDPs) under a generative model. Motivated by the Algorithms with Advice framework (Mitzenmacher and Vassilvitskii, 2022), we propose a novel framework to investigate how black-box predictions of the transition matrix can enhance sample efficiency in solving DMDPs and improve sample complexity bounds. We focus on DMDPs with $N$ state–action pairs and discount factor $\gamma$. We first provide an impossibility result showing that, in the presence of predictions with unknown accuracy, no sampling policy can compute an $\epsilon$-optimal policy with a sample complexity better than $\tilde{O}((1-\gamma)^{-3} N \epsilon^{-2})$, which matches the state-of-the-art minimax sample complexity bound without prediction. In complement, we design an algorithm based on minimax optimization techniques that leverages predictions of the transition matrix without requiring knowledge of the prediction error. Our algorithm achieves a sample complexity bound that depends on the prediction error and is uniformly better than $\tilde{O}((1-\gamma)^{-4} N \epsilon^{-2})$, the previous best result derived from convex optimization methods. In some cases, our bound even improves upon the state-of-the-art $\tilde{O}((1-\gamma)^{-3} N \epsilon^{-2})$, despite not having access to the prediction quality.
Successful Page Load