Hybrid Reinforcement Learning in Adversarial Markov Decision Processes
Abstract
We study hybrid Reinforcement Learning (RL) in adversarial Markov Decision Processes (MDPs), where the learner simultaneously receives on-policy feedback from the executed policy and off-policy feedback from a fixed behavior policy, and loss functions can change arbitrarily over time. On-policy feedback allows exploration and ensures the worst-case guarantee against any comparator policy, while off-policy feedback provides coverage-dependent guarantee that scales with the "mismatch" between the behavior and comparator policies (called coverage ratio) and can be sharper than on-policy results whenever this ratio is small. We propose a new hybrid RL framework that accommodates adversarial losses and unknown transitions, preserving off-policy guarantees while ensuring non-trivial worst-case performance.