ScaleMoE: Mixture-of-Experts for Scalable Continuous Control in Actor-Critic Reinforcement Learning
Abstract
Scaling network remains a bottleneck in deep reinforcement learning (RL): simply enlarging actor–critic networks destabilizes training and soon saturates performance. Although recent monolithic architectures such as SimBa and BRC have shown that carefully designed inductive biases can enable positive scaling up to a certain size, their improvements plateau soon as model parameters grow further. This work introduces ScaleMoE, a scalable RL architecture that integrates Mixture-of-Experts (MoE) modules into both the actor and critic of modern continuous control algorithms. Two complementary gating schemes are studied: output-level aggregation of per-expert policies and Q-functions, and feature-level fusion of expert representations before a shared head. We instantiate ScaleMoE on two representative monolithic RL baselines: the single-task method SimBa and the multi-task method BRC. Experiments across the DeepMind Control Suite, MetaWorld, and HumanoidBench show that progressively increasing the number of experts (up to 64) yields substantial improvements in returns, significantly outperforming monolithic networks of comparable or even greater parameter counts. Results demonstrate that ScaleMoE provides an efficient and effective scaling axis for deep RL in continuous control.