Minimax-Optimal Policy Regret in Partially Observable Markov Games
Raman Arora
Abstract
We study sequential decision-making in partially observable environments against strategic, adaptive opponents, modeled as partially observable Markov games (POMGs). The central challenge is to learn latent dynamics from partial observations while facing an adversary whose behavior depends on the learner's strategy, making standard regret notions inadequate. We develop an epoch-based optimistic, model-based framework and show it achieves policy regret $\tilde{O}\!\left(H(m+\sqrt{\beta})\sqrt{d_E T}\right)$, where $H$ is the horizon, $m$ the adversary's memory bound, $d_E$ the Eluder dimension of the joint model class, and $T$ the number of episodes. Our algorithm selects one policy per geometrically growing epoch using confidence sets built cumulatively from past data. We also prove a matching $\Omega(\sqrt{d_E T})$ lower bound (optimal up to log factors), and extend the framework to fading-memory adversaries and horizon-adaptive variants. These results give the first tight characterization of learnability in POMGs against adaptive opponents.
Successful Page Load