MRPO: Magnitude-Regularized Policy Optimization via L1 Constraints
Wei Han ⋅ Yuanxing Liu ⋅ Mingda Li ⋅ Ruiyu Xiao ⋅ Weinan Zhang ⋅ Ting Liu
Abstract
Reinforcement learning (RL) for large language models (LLMs) relies on imperfect reward supervision, necessitating constraints on policy updates to prevent overfitting. Nevertheless, the widely adopted KL constraint over-penalizes actions with low reference probabilities and lacks the sparsity to discard marginal policy shifts. In contrast, the L1-norm offers a distinct mechanism that is more tolerant of low-probability actions yet strictly suppresses minor probability perturbations. Motivated by this, we propose $\textbf{M}$agnitude-$\textbf{R}$egularized $\textbf{P}$olicy $\textbf{O}$ptimization (MRPO), which enforces an L1-norm constraint on policy updates. We demonstrate that MRPO permits substantial probability boosts for low-probability actions and induces sparse updates, ensuring invariance to noise that preserves the top-ranking order. Furthermore, MRPO guarantees convergence in general RL settings and achieves a tighter approach to optimality than KL-based methods in single-step scenarios. Empirically, MRPO delivers exceptional results across diverse scenarios, notably doubling the performance gains of GRPO in preference alignment, outperforming DAPO in mathematical reasoning, and surpassing DPO in offline settings using only binary rewards.
Successful Page Load