ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation
Hongru Hou ⋅ Tiehua Mei ⋅ Denghui Geng ⋅ Jinhui Huang ⋅ Ao Xu ⋅ Hengrui Chen ⋅ Jiaqing Liang ⋅ Deqing Yang
Abstract
Proactive Recommender Systems (PRSs) aim to guide user preference shift toward target items by generating paths of intermediate recommendations. Reinforcement learning (RL) provides a principled framework for optimizing such sequential decision tasks. Based on path rewards, RL can naturally jointly optimize short-term acceptance and long-term guidance effectiveness. However, naively applying policy gradients to PRS results in deficient gradient estimation. It has been observed that length-dependent bias causes gradients to favor path extension over deeper exploration, while weighting each step by path-level reward leads to high gradient variance. To rectify these two deficiencies, we propose an effective RL framework $\textbf{ProRL}$ with two novel mechanisms for proactive recommendation. First, Stepwise Reward Centering subtracts expected rewards to neutralize length-dependent bias, ensuring that path extension yields zero expected gradient signal. Second, Position-Specific Advantage Estimation leverages the reward decomposition structure to compute step-dependent baselines, reducing gradient variance. Together, these mechanisms yield policy gradients that precisely target path quality. Our experiments on three real-world datasets demonstrate that ProRL significantly outperforms state-of-the-art PRSs. Our codes are available at https://anonymous.4open.science/r/ProRL-D56DHM.
Successful Page Load