Multi-timescale Reinforcement Learning by Value Reconstruction
Abstract
Most reinforcement learning (RL) baselines maximize future cumulative rewards with a fixed single discount factor, which limits their performance in complex sequential decision-making tasks due to a failure to balance short-term objectives and long-term planning. To address this issue, this paper focuses on a multi-timescale critic framework, where each component corresponds to a Q-value with a distinct discount factor. Two key improvements are proposed: (1) A Neural Reward Decoder reconstructs the reward sequence from multi-scale Q-values, with value and reward reconstruction losses enhancing Q-value estimation consistency; (2) A cross-attention-based Q-weight predictor adaptively adjusts Q-value weights via current observations to generate the final Q-value for policy optimization. Extensive experiments on DMControl and CARLA benchmarks demonstrate that our method significantly outperforms state-of-the-art (SOTA) baselines. Furthermore, we validate the framework's generalizability by integrating it with both off-policy (SAC, DrQ-v2) and on-policy (PPO) algorithms, achieving consistent performance gains. The code is available in the supplementary material.