Clipped Q-Learning: Your Value Clipping Is Secretly A Robust Operator
Zhishuai Liu ⋅ Pan Xu
Abstract
We study a simple yet principled modification of classical Q-learning that clips the value estimate in the Bellman backup by a threshold $\lambda$. The resulting algorithm, clipped Q-learning, is motivated by a key theoretical insight: the clipped Bellman backup is an unbiased one-sample estimation of a robust Bellman operator arising naturally from a transition-regularized MDP framework. This formulation corresponds to optimizing performance against a specific class of adversarial dynamics perturbations at the test time that reallocate transition probability mass away from high-value states, thereby inducing conservative but stable decision making. Under this interpretation, clipped Q-learning can be viewed as tracking the fixed point of the robust Bellman equation and learning policies that hedge against adversarial dynamics shifts at test time. We analyze two clipped Q-learning variants with an optimistic exploration bonus and establish polynomial regret guarantees, demonstrating statistical efficiency. Beyond the tabular setting, our framework suggests that value clipping is a modular mechanism that can be incorporated into general value-based RL algorithms with function approximation. As a proof of concept, we evaluate a clipped Double DQN algorithm on a control task and observe robustness improvements consistent with our theoretical predictions.
Successful Page Load