Time-Consistent Robust Multi-Objective Reinforcement Learning via a Bellman–Isaacs Weight-Adversary Recursion
Mingxi Hu ⋅ Meiling Yu
Abstract
Most multi-objective reinforcement learning (MORL) methods either condition on a fixed preference weight $w$ or consider episodic robustness where an adversary selects a single $w$ per episode. We study a time-consistent robustness model with reactive preferences: after each transition, an opponent chooses the next weight $w_{t+1}$ after observing $s_{t+1}$, and incurs a switching cost $\lambda D_\Phi(w_{t+1}\mid w_t)$ based on a Bregman divergence. This yields a Bellman–Isaacs recursion with an inner weight minimization at every backup. We prove the induced operator is a contraction and derive a Bellman-residual certificate that turns approximation error into a uniform bound on robust performance. We develop practical solvers in both tabular and deep settings using Bregman-prox inner updates and a stabilized fixed-point iteration. To evaluate robustness without optimistic critic reuse, we introduce BR-$K$, testing policies against $K$ independently trained best-response preference adversaries. Across MO-Gymnasium benchmarks, our approach consistently improves WRR under strong step-wise opponents over preference-conditioned baselines while keeping DRIFT smoothly controllable via $\lambda$.
Successful Page Load