Poster
in
Workshop: Reinforcement Learning for Real Life
Value-Based Deep Reinforcement Learning Requires Explicit Regularization
Aviral Kumar · Rishabh Agarwal · Aaron Courville · Tengyu Ma · George Tucker · Sergey Levine
While deep reinforcement learning (RL) methods present an appealing approach to sequential decision-making, such methods are often unstable in practice. What accounts for this instability? Recent theoretical analysis of overparameterized supervised learning with stochastic gradient descent shows that learning is driven by an implicit regularizer, which results in simpler functions that generalize despite overparameterization. However, in this paper, we show that in the case of deep RL, this very same implicit regularization can instead lead to degenerate features. Specifically, features learned by the value function at state-action tuples appearing on both sides of the Bellman update "co-adapt" to each other, giving rise to poor solutions. We characterize the nature of this implicit regularizer in temporal difference learning algorithms and show that this regularizer recapitulates recent empirical findings regarding the rank collapse of learned features and provides an understanding for its cause. To address the adverse impacts of this implicit regularization, we propose a simple and effective explicit regularizer, DR3. DR3 minimizes the similarity of learned features of the Q-network at consecutive state-action tuples in the TD update. Empirically, when combined with existing offline RL methods, DR3 substantially improves both performance and stability on Atari 2600 games, D4RL domains, and robotic manipulation from images.