Bilevel Optimization over Saddle Points of Zero-Sum Markov Games
Zihao Zheng ⋅ Irwin King ⋅ Songtao Lu
Abstract
Reinforcement learning (RL) is often hierarchical: an upper-level (UL) learner selects model parameters while a lower-level (LL) decision-making process responds, forming a nested two-level optimization structure captured by bilevel optimization. Most existing bilevel RL methods assume a single-policy LL Markov decision process (MDP), and thus miss the competitive structure in applications such as incentive design where multiple policies interact. We study a class of bilevel optimization problems whose LL is a regularized min--max zero-sum Markov game, and whose UL optimizes through the saddle-point equilibrium induced by the LL game. We propose a penalty-based first-order policy-gradient method built on the Nikaido–Isoda function, termed penalty-augmented Nikaido–Isoda descent–ascent (PANDA), which avoids UL hypergradients and requires no second-order information by exploiting the min--max game structure. We prove that PANDA converges to stationary points of this bilevel problem without restrictive assumptions such as convexity on either the UL or LL objectives. Moreover, PANDA reaches an $\epsilon$-stationary point in $\tilde{\mathcal{O}}(\epsilon^{-1})$ iterations with sample complexity $\tilde{\mathcal{O}}(\epsilon^{-3})$, matching the best-known rates for bilevel RL with single-policy LL MDPs. Experiments further demonstrate superior performance over closely related baselines.
Successful Page Load