Bridging Scaling Laws to On-Policy Reinforcement Learning via Adaptive Batch Scaling
Abstract
While the "Scaling Laws" have driven massive success in Computer Vision and NLP through large-scale training with massive batch sizes, Reinforcement Learning (RL) has largely failed to benefit from this paradigm. In RL, increasing batch sizes beyond a modest threshold often leads to diminishing returns or performance degradation due to the inherent non-stationarity of the data distribution. In this paper, we challenge the prevailing static view of batch sizes in RL by observing that the degree of non-stationarity is not constant: early training involves rapid behavioral shifts requiring small batches for plasticity, whereas late training approaches a quasi-stationary regime where large batches are essential for high-precision convergence. To leverage this insight, we propose Adaptive Batch Scaling (ABS), a simple yet effective framework that dynamically adjusts the effective batch size based on the stability of the learning process. We introduce Behavioral Divergence, a novel metric that quantifies non-stationarity by measuring action-level shifts between policy updates, and use it to scale the batch size inversely to the policy's volatility. By integrating ABS with the Parallelised Q-Network (PQN) algorithm, we demonstrate on the ALE benchmark that our method synergizes early-stage model plasticity with late-stage accurate and stable convergence. Our empirical results show that ABS not only yields substantial performance improvements over static baselines but also successfully scales to larger network architectures, offering a foundational step toward bridging the scaling gap between RL and supervised learning.