Behavior-Invariant Task Representation Learning with Transformer-based World Models for Offline Meta-Reinforcement Learning
Abstract
Offline Meta-Reinforcement Learning leverages static datasets to enable agents to generalize to unseen environments by combining offline efficiency with meta-learning adaptability, yet it faces fundamental challenges from context and policy distribution shifts. These issues hinder agents trained on offline datasets from adapting to online environments, and are further exacerbated under sparse-reward settings. As a result, agents often become trapped in an inherent pattern dilemma, failing to achieve robust generalization. In this work, we propose a novel framework that integrates information-theoretic task representation learning with a Transformer-based stochastic world model. Our approach extracts task-defining latent variables that are invariant to behavior policy, thereby effectively mitigating the context distribution shift. To further handle policy shift and model exploitation, we incorporate conservative value regularization into imagination-based rollouts, fully leveraging task representations that are sufficient for reliable adaptation. Our method is evaluated on multiple offline environments, where it consistently outperforms state-of-the-art approaches, achieving superior stability and generalization under severe out-of-distribution and sparse-reward settings.