Provably Efficient Policy-Reward Co-Pretraining for Adversarial Imitation Learning
Abstract
Adversarial imitation learning (AIL) demonstrates superior expert sample efficiency compared to behavioral cloning (BC), yet requires substantial online environment interaction. While recent empirical work has explored initializing AIL algorithms with BC-pretrained policies to address this limitation, a rigorous theoretical understanding of pretraining's role in AIL remains lacking. This paper provides a systematic theoretical analysis and develops principled pretraining algorithms for accelerating AIL. We first analyze AIL with policy pretraining alone, identifying reward error as the dominant error source and thereby uncovering a critical yet previously unexplored gap: the omission of reward pretraining. Leveraging this insight, we introduce a principled policy-reward co-pretraining mechanism through reward-shaping analysis. Our analysis reveals a fundamental connection between expert policies and shaping rewards, naturally motivating CoPT-AIL, an approach that jointly pretrains both policies and rewards through a single BC procedure. We prove that CoPT-AIL achieves an improved imitation gap bound compared to standard AIL, establishing the first theoretical guarantee for pretraining benefits in AIL. Experimental results validate CoPT-AIL's superior performance over existing AIL methods.