Harmonized Dual Policy Improvement for Model-based Reinforcement Learning
Guojian Zhan ⋅ Likun Wang ⋅ Feihong Zhang ⋅ Yang Guan ⋅ Shengbo Li
Abstract
Policy-planner bootstrapping has emerged as a powerful paradigm in model-based reinforcement learning (MBRL). We formalize this process as a dual policy improvement mechanism synergizing: (i) exploitative improvement via off-policy $Q$-maximization, and (ii) lookahead improvement via planner alignment. While we theoretically prove that these improvements anchor to the same optimum, practical training process inevitably encounters gradient disagreement. Exacerbated by approximation inaccuracies and non-stationary data, this disagreement induces destructive interference in policy updates, destabilizing the bootstrapping loop and leading to suboptimal convergence. To address this, we propose harmonized dual policy improvement (HDPI), a gradient-level framework that reconciles exploitative and lookahead improvements through a harmonic optimization scheme. This scheme effectively maximizes the worst-case inner product between the harmonized update and the original gradients, ensuring directional consistency and stabilizing policy evolution. Extensive empirical evaluations on 14 challenging tasks from the DeepMind Control Suite and the Humanoid-Bench demonstrate that HDPI significantly enhances training stability and asymptotic performance, outperforming a wide range of strong baselines.
Successful Page Load