Timezone: »

🎤 SRL: Scaling Distributed Reinforcement Learning to Over Ten Thousand Cores
Zhiyu Mei · Wei Fu · Guangju Wang · Huanchen Zhang · Yi Wu

Sat Jul 29 01:25 PM -- 01:40 PM (PDT) @
Event URL: https://openreview.net/forum?id=cPmIdf5Wg8 »

The ever-growing complexity of reinforcement learning (RL) tasks demands a distributed system to train intelligent agents by efficiently producing and processing a massive amount of data. In this paper, we propose a more comprehensive computational abstraction for RL training tasks and introduce a general, scalable, and efficient RL system called Really Scalable RL (SRL), featuring a novel architecture that separates three major computation components in RL training. Our evaluation demonstrates that SRL outperforms a popular open-source RL system RLlib RLlib (Liang et al., 2017) in training throughput. Moreover, to assess the learning performance of SRL, we have conducted a benchmark on a large scale cluster with 32 Nvidia A100 GPUs, 64 Nvidia RTX 3090 GPUs and more than 10000 CPU cores, reproducing the results of industrial production system from OpenAI, Rapid (Berner et al., 2019) in the hide and-seek environment (Baker et al., 2019). The results show that SRL is capable of achieving up to 5 times training speedup compared to published results in Baker et al. (2019).

Author Information

Zhiyu Mei (Tsinghua University, Tsinghua University)
Wei Fu (Tsinghua University)
Guangju Wang
Huanchen Zhang
Yi Wu (Tsinghua University & Shanghai Qi Zhi Institute)

More from the Same Authors