REG: In-Sample RL via Regularizing the Evaluation Gap
Abstract
Distribution shift poses a fundamental challenge in offline reinforcement learning, often leading to value overestimation when querying out-of-distribution actions. We introduce Regularized Evaluation Gap (REG) as a bridge between implicit methods like IQL and explicit conservative methods. We formulate policy evaluation as a robust optimization problem over an ambiguity set of critics and show that IQL’s objective can be viewed as an approximate dual solution to this problem. To extract a policy from the learned value function, we propose a practical Orthogonal Policy Gradient (OPG) update. This method regularizes an aggressive, mode-seeking policy gradient by projecting it onto the subspace orthogonal to a stable, in-sample behavior cloning gradient. Extensive D4RL experiments demonstrate that REG matches state-of-the-art performance among both Gaussian methods and diffusion-based approaches without the computational burden of the latter.