Taming the Aleatoric Impulse in Off-Policy Reinforcement Learning
Abstract
Off-policy reinforcement learning is vulnerable to overestimation bias, which is rooted in the total value uncertainty. However, existing methods typically misaddress this by targeting the epistemic component, neglecting the aleatoric component. We identify for the first time that this oversight fails to contain a massive bias surge, termed the Aleatoric Impulse. Although transient, this impulse fundamentally derails the learning trajectory, permanently locking the agent into suboptimal policies. To counteract this, we propose Aleatoric Impulse Damping (AID), the first mechanism that models total value uncertainty by disentangling the return variance into epistemic and aleatoric components, followed by their adaptive weighted recombination. Leveraging this derived uncertainty, the critic constructs a pessimistic lower confidence bound to surgically suppress the impulse. Complementing this, the actor utilizes a symmetrical upper confidence bound to drive optimistic exploration, ensuring that the necessary pessimism does not compromise exploration efficiency. We integrate this mechanism into the Distributional Soft Actor-Critic algorithm to establish DSAC-AID. Extensive experiments on the high-dimensional Gym-MuJoCo and DeepMind Control Suite benchmarks demonstrate that it achieves state-of-the-art results in final performance.