## Efficient Learning for AlphaZero via Path Consistency

### Dengwei Zhao · Shikui Tu · Lei Xu

##### Hall E #834

Keywords: [ APP: Everything Else ] [ DL: Everything Else ] [ RL: Deep RL ]

[ Abstract ]
[ [
Wed 20 Jul 3:30 p.m. PDT — 5:30 p.m. PDT

Spotlight presentation: Reinforcement Learning: Deep RL
Wed 20 Jul 7:30 a.m. PDT — 9 a.m. PDT

Abstract: In recent years, deep reinforcement learning have made great breakthroughs on board games. Still, most of the works require huge computational resources for a large scale of environmental interactions or self-play for the games. This paper aims at building powerful models under a limited amount of self-plays which can be utilized by a human throughout the lifetime. We proposes a learning algorithm built on AlphaZero, with its path searching regularised by a path consistency (PC) optimality, i.e., values on one optimal search path should be identical. Thus, the algorithm is shortly named PCZero. In implementation, historical trajectory and scouted search paths by MCTS makes a good balance between exploration and exploitation, which enhances the generalization ability effectively. PCZero obtains $94.1\%$ winning rate against the champion of Hex Computer Olympiad in 2015 on $13\times 13$ Hex, much higher than $84.3\%$ by AlphaZero. The models consume only $900K$ self-play games, about the amount humans can study in a lifetime. The improvements by PCZero have been also generalized to Othello and Gomoku. Experiments also demonstrate the efficiency of PCZero under offline learning setting.

Chat is not available.