Model-Free Robust Average-Reward Reinforcement Learning with Sample Complexity Analysis
Zachary Roch ⋅ George Atia ⋅ Yue Wang
Abstract
Robust reinforcement learning (RL) under the average-reward criterion is essential for long-term decision-making, particularly when the environment may differ from its training dynamics. However, most existing studies focus on model-based settings and provide only asymptotic guarantees, hindering their principled understanding and practical deployment, especially in data-limited scenarios. We aim to close this gap by proposing a model-free algorithm, Robust Halpern Iteration (RHI). We first design our algorithm based on a black-box sampling oracle, which can estimate the worst-case performance accurately. We then derive the finite sample complexity of RHI under the generative model setting, assuming the sampling oracle. To concretely design such an oracle, we propose a $K$-order multi-level Monte-Carlo estimator, which is shown to have a lower bias compared to prior methods. We further instantiate our design for multiple uncertainty models, including KL and $\chi^2$ divergence sets, and show that our RHI algorithm achieves an $\varepsilon$-optimal robust policy with a sample complexity of $\tilde{\mathcal{O}}\left( \frac{SA\mathcal{H}^2}{\varepsilon^{(2+o(1))}}\right)$, where $S,A$ are the number of states and actions, and $\mathcal{H}$ is the robust optimal span. Our result asymptotically matches the best complexity in robust average reward RL.
Successful Page Load