Efficiently Solving MDPs with Stochastic Mirror Descent

Yujia Jin · Aaron Sidford

Keywords: [ Computational Learning Theory ] [ Convex Optimization ] [ Reinforcement Learning Theory ] [ Optimization - Convex ]

[ Abstract ]
Wed 15 Jul 10 a.m. PDT — 10:45 a.m. PDT
Wed 15 Jul 9 p.m. PDT — 9:45 p.m. PDT

Abstract: In this paper we present a unified framework based on primal-dual stochastic mirror descent for approximately solving infinite-horizon Markov decision processes (MDPs) given a generative model. When applied to an average-reward MDP with $A_{tot}$ total actions and mixing time bound $t_{mix}$ our method computes an $\epsilon$-optimal policy with an expected $\widetilde{O}(t_{mix}^2 A_{tot} \epsilon^{-2})$ samples from the state-transition matrix, removing the ergodicity dependence of prior art. When applied to a $\gamma$-discounted MDP with $A_{tot}$ total actions our method computes an $\epsilon$-optimal policy with an expected $\widetilde{O}((1-\gamma)^{-4} A_{tot} \epsilon^{-2})$ samples, improving over the best-known primal-dual methods while matching the state-of-the-art up to a $(1-\gamma)^{-1}$ factor. Both methods are model-free, update state values and policy simultaneously, and run in time linear in the number of samples taken.

Chat is not available.