Spotlight
A Sharp Analysis of Model-based Reinforcement Learning with Self-Play
Qinghua Liu · Tiancheng Yu · Yu Bai · Chi Jin
Abstract:
Model-based algorithms---algorithms that explore the environment through building and utilizing an estimated model---are widely used in reinforcement learning practice and theoretically shown to achieve optimal sample efficiency for single-agent reinforcement learning in Markov Decision Processes (MDPs). However, for multi-agent reinforcement learning in Markov games, the current best known sample complexity for model-based algorithms is rather suboptimal and compares unfavorably against recent model-free approaches. In this paper, we present a sharp analysis of model-based self-play algorithms for multi-agent Markov games. We design an algorithm \emph{Optimistic Nash Value Iteration} (Nash-VI) for two-player zero-sum Markov games that is able to output an -approximate Nash policy in episodes of game playing, where is the number of states, are the number of actions for the two players respectively, and is the horizon length. This significantly improves over the best known model-based guarantee of , and is the first that matches the information-theoretic lower bound except for a factor. In addition, our guarantee compares favorably against the best known model-free algorithm if , and outputs a single Markov policy while existing sample-efficient model-free algorithms output a nested mixture of Markov policies that is in general non-Markov and rather inconvenient to store and execute. We further adapt our analysis to designing a provably efficient task-agnostic algorithm for zero-sum Markov games, and designing the first line of provably sample-efficient algorithms for multi-player general-sum Markov games.
Chat is not available.