Revisiting Distribution Correction Estimation for Offline Imitation Learning with Suboptimal Dataset
Abstract
Imitation Learning (IL) has demonstrated strong capabilities in learning high-quality policies from expert demonstrations for sequential decision-making tasks. Nonetheless, its effectiveness is significantly constrained in low-expert-data regimes. To mitigate this issue, previous works introduce ``offline IL with supplementary data" which augments expert demonstrations with additional, low-cost data generated by suboptimal policies. A prominent framework for this setting is Distribution Correction Estimation (DICE), which estimates the optimal density ratio by solving the dual of a divergence minimization problem between the learned policy and the expert visitation distribution. Despite their theoretical appeal, existing DICE-based methods often require adding a dataset regularizer to the divergence objective, or rely on a strict coverage assumption. These weaknesses limit the capability of DICE-based methods, causing them to be inefficient in some contexts. In this paper, we introduce ReDICE, a new method to address these limitations. ReDICE is derived by formulating an objective under a mixture distribution from the KL divergence between expert and learned policy occupancies. We formally prove that the dual of this formulation is mathematically equivalent to a stable Gumbel regression objective. Furthermore, we introduce a novel policy extraction mechanism that significantly improves performance in practice. Experiments across diverse benchmarks show that ReDICE achieves state-of-the-art results relative to prior offline IL baselines.