Workshop: ES-FoMo: Efficient Systems for Foundation Models

🎤 SRL: Scaling Distributed Reinforcement Learning to Over Ten Thousand Cores

Zhiyu Mei · Wei Fu · Guangju Wang · Huanchen Zhang · Yi Wu

[ Abstract ] [ Project Page ]
[ OpenReview
Sat 29 Jul 1:25 p.m. PDT — 1:40 p.m. PDT


The ever-growing complexity of reinforcement learning (RL) tasks demands a distributed system to train intelligent agents by efficiently producing and processing a massive amount of data. In this paper, we propose a more comprehensive computational abstraction for RL training tasks and introduce a general, scalable, and efficient RL system called Really Scalable RL (SRL), featuring a novel architecture that separates three major computation components in RL training. Our evaluation demonstrates that SRL outperforms a popular open-source RL system RLlib RLlib (Liang et al., 2017) in training throughput. Moreover, to assess the learning performance of SRL, we have conducted a benchmark on a large scale cluster with 32 Nvidia A100 GPUs, 64 Nvidia RTX 3090 GPUs and more than 10000 CPU cores, reproducing the results of industrial production system from OpenAI, Rapid (Berner et al., 2019) in the hide and-seek environment (Baker et al., 2019). The results show that SRL is capable of achieving up to 5 times training speedup compared to published results in Baker et al. (2019).

Chat is not available.