Skip to yearly menu bar Skip to main content

Workshop: Over-parameterization: Pitfalls and Opportunities

Value-Based Deep Reinforcement Learning Requires Explicit Regularization

Aviral Kumar · Rishabh Agarwal · Aaron Courville · Tengyu Ma · George Tucker · Sergey Levine


While deep reinforcement learning (RL) methods present an appealing approach to sequential decision-making, such methods are often unstable in practice. What accounts for this instability? Recent theoretical analysis of overparameterized supervised learning with stochastic gradient descent shows that learning is driven by an implicit regularizer once it reaches the zero training error regime, which results in “simpler” functions that generalize. However, in this paper, we show that in the case of deep RL, implicit regularization can instead lead to degenerate features by characterizing the induced implicit regularizer in the semi-gradient TD learning setting in RL with a fixed-dataset (i.e., offline RL). The regularize recapitulates recent empirical findings regarding the rank collapse of learned features and provides an understanding for its cause. To address the adverse impacts of this implicit regularization, we propose a simple and effective explicit regularizer for TD learning, DR3. DR3 minimizes the similarity of learned features of the Q-network at consecutive state-action tuples in the TD update. Empirically, when combined with existing offline RL methods, DR3 substantially improves both performance and stability on Atari 2600 games, D4RL domains, and robotic manipulation from images.