ICML 2021 A Sharp Analysis of Model-based Reinforcement Learning with Self-Play Spotlight

Spotlight

A Sharp Analysis of Model-based Reinforcement Learning with Self-Play

Qinghua Liu · Tiancheng Yu · Yu Bai · Chi Jin

[ Abstract ] [ Visit Reinforcement Learning Theory 4 ] [ Paper ]

[ Paper ]

Abstract: Model-based algorithms---algorithms that explore the environment through building and utilizing an estimated model---are widely used in reinforcement learning practice and theoretically shown to achieve optimal sample efficiency for single-agent reinforcement learning in Markov Decision Processes (MDPs). However, for multi-agent reinforcement learning in Markov games, the current best known sample complexity for model-based algorithms is rather suboptimal and compares unfavorably against recent model-free approaches. In this paper, we present a sharp analysis of model-based self-play algorithms for multi-agent Markov games. We design an algorithm \emph{Optimistic Nash Value Iteration} (Nash-VI) for two-player zero-sum Markov games that is able to output an

ϵ

$\epsilon$ -approximate Nash policy in

\tilde{O} (H^{3} S A B / ϵ^{2})

$\tilde{\mathcal{O}}(H^3SAB/\epsilon^2)$ episodes of game playing, where

S

$S$ is the number of states,

A, B

$A,B$ are the number of actions for the two players respectively, and

H

$H$ is the horizon length. This significantly improves over the best known model-based guarantee of

\tilde{O} (H^{4} S^{2} A B / ϵ^{2})

$\tilde{\mathcal{O}}(H^4S^2AB/\epsilon^2)$ , and is the first that matches the information-theoretic lower bound

Ω (H^{3} S (A + B) / ϵ^{2})

$\Omega(H^3S(A+B)/\epsilon^2)$ except for a

min {A, B}

$\min\{A,B\}$ factor. In addition, our guarantee compares favorably against the best known model-free algorithm if

min {A, B} = o (H^{3})

$\min\{A,B\}=o(H^3)$ , and outputs a single Markov policy while existing sample-efficient model-free algorithms output a nested mixture of Markov policies that is in general non-Markov and rather inconvenient to store and execute. We further adapt our analysis to designing a provably efficient task-agnostic algorithm for zero-sum Markov games, and designing the first line of provably sample-efficient algorithms for multi-player general-sum Markov games.

Chat is not available.