Mind Your Entropy: From Maximum Entropy to Trajectory Entropy-Constrained RL
Abstract
Maximum entropy has become a mainstream off-policy reinforcement learning (RL) framework for balancing exploitation and exploration. However, two bottlenecks still limit further performance gains: (1) non-stationary Q-value estimation stemming from the joint injection of entropy and the concurrent updating of its temperature parameter; and (2) short-sighted local entropy tuning, which adjusts temperature solely based on current single-step entropy without accounting for cumulative entropy over time. In this paper, we broaden the maximum entropy framework by proposing a trajectory entropy-constrained reinforcement learning (TECRL) framework to address these limitations. We begin by introducing reward-entropy separation (RES) to decouple the value targets, ensuring they remain stable and unaffected by temperature fluctuations. Subsequently, the resulting entropy Q-function is leveraged to explicitly quantify expected cumulative entropy, allowing for the enforcement of a trajectory entropy constraint (TEC) to govern long-term stochasticity. We instantiate this framework as DSAC-E, a practical off-policy algorithm that builds upon the latest distributional soft actor-critic. Extensive evaluations across 10 challenging tasks in locomotion, robotic manipulation, and vision-based driving domains demonstrate that DSAC-E consistently outperforms baselines in both cumulative returns and training stability.