Timezone: »

A Sharp Analysis of Model-based Reinforcement Learning with Self-Play
Qinghua Liu · Tiancheng Yu · Yu Bai · Chi Jin

Wed Jul 21 06:25 PM -- 06:30 PM (PDT) @ None
Model-based algorithms---algorithms that explore the environment through building and utilizing an estimated model---are widely used in reinforcement learning practice and theoretically shown to achieve optimal sample efficiency for single-agent reinforcement learning in Markov Decision Processes (MDPs). However, for multi-agent reinforcement learning in Markov games, the current best known sample complexity for model-based algorithms is rather suboptimal and compares unfavorably against recent model-free approaches. In this paper, we present a sharp analysis of model-based self-play algorithms for multi-agent Markov games. We design an algorithm \emph{Optimistic Nash Value Iteration} (Nash-VI) for two-player zero-sum Markov games that is able to output an $\epsilon$-approximate Nash policy in $\tilde{\mathcal{O}}(H^3SAB/\epsilon^2)$ episodes of game playing, where $S$ is the number of states, $A,B$ are the number of actions for the two players respectively, and $H$ is the horizon length. This significantly improves over the best known model-based guarantee of $\tilde{\mathcal{O}}(H^4S^2AB/\epsilon^2)$, and is the first that matches the information-theoretic lower bound $\Omega(H^3S(A+B)/\epsilon^2)$ except for a $\min\{A,B\}$ factor. In addition, our guarantee compares favorably against the best known model-free algorithm if $\min\{A,B\}=o(H^3)$, and outputs a single Markov policy while existing sample-efficient model-free algorithms output a nested mixture of Markov policies that is in general non-Markov and rather inconvenient to store and execute. We further adapt our analysis to designing a provably efficient task-agnostic algorithm for zero-sum Markov games, and designing the first line of provably sample-efficient algorithms for multi-player general-sum Markov games.

Author Information

Qinghua Liu (Princeton University)
Tiancheng Yu (MIT)

I am a 2nd-year PhD student in MIT EECS, advised by Prof. Suvrit Sra. Before MIT, I was an undergraduate student in Tsinghua University, major in Electronic Engineering and minor in Statistics.

Yu Bai (Salesforce Research)
Chi Jin (Princeton University)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors