CompleteP for RL: Maintaining Feature Learning When Scaling Deep Reinforcement Learning
M Ganesh Kumar ⋅ Adam Lee ⋅ Blake Bordelon ⋅ Cengiz Pehlevan
Abstract
The maximal update parameterization ($\mu P$) has been influential in supervised and unsupervised learning conditions, with fixed data distributions, owing to its ability to maintain feature learning across larger parameter scales. This parameterization facilitates more consistent learning dynamics and learned features across model sizes. Moreover, optimal hyperparameters such as learning rate approximately transfer from small to larger models, minimizing the computational overhead of hyperparameter sweeps. However, it remains elusive if these benefits readily transfer to the reinforcement learning framework, where the model's learning dynamics are coupled to the shifting data distribution. Reinforcement learning agents must continually adapt to non-stationary data distribution shifts throughout training. We empirically study how two regimes, the ''rich'' CompleteP and ''lazy'' Neural Tangent Kernel (NTK) parameterizations affect hyperparameter transfer, feature and policy consistency as we scale reinforcement learning agents. Ultimately, we show that agents trained using CompleteP consequentially improves compute and reward efficiency compared to the NTK parameterization over 16 continuous control tasks and variants e.g. normalization and sparse rewards. Hence, we argue that adopting the CompleteP parameterization minimizes learning inconsistencies across model sizes to improve compute efficiency when scaling up.
Successful Page Load