Unifying Value Alignment and Assignment in Cross-Domain Offline Reinforcement Learning with Heterogeneous Datasets
Abstract
Cross-domain offline reinforcement learning (RL) aims to train an agent that performs well in the target domain using a limited target domain dataset and a source domain dataset that exhibits a dynamics shift. Training directly on the original source dataset typically leads to performance collapse. Recent studies perform data filtering from the perspective of dynamics alignment or value alignment to enable efficient policy transfer. However, these studies are typically validated on single-domain or single-behavior-policy source datasets. In this work, we explore a more general setting where the source datasets may be collected from multiple source domains by diverse behavior policies, which we name heterogeneous cross-domain offline RL. We first uncover a critical yet overlooked issue in this setting: \textit{value misassignment}. Empirically and theoretically, we demonstrate that value misassignment can undermine value alignment, mislead data filtering toward selecting suboptimal samples, and loosen the suboptimality gap, thereby degrading the agent’s performance. To address this issue, we propose V2A, a simple yet effective framework that integrates dynamics alignment, value alignment, and value assignment. V2A first employs temporally-consistent modality representation learning to extract dynamics modalities from the source dataset, followed by modality-aware advantage learning to rectify value alignment. Finally, it adopts a data filtering paradigm to selectively share source data for policy learning. Empirical results show that under general heterogeneous cross-domain offline RL settings, V2A significantly outperforms strong baseline methods and exhibits excellent performance across multiple tasks and datasets.