Near-Minimax Multi-Objective RL under Predictable Adversarial Preferences and Preference-Free Exploration in Linear MDPs
Mingxi Hu ⋅ Meiling Yu
Abstract
Multi-objective reinforcement learning (MORL) must often support preferences that change online or are specified only after data collection. We study finite-horizon MORL with vector feedback in linear MDPs under two protocols: (i) predictable adversarial preferences revealed before each episode, and (ii) reward-free preference-free exploration (PFE), where exploration observes only transitions and must later answer arbitrary preference queries. Standard reductions are protocol-unsafe: re-scalarizing past stochastic rewards with future weights breaks the martingale structure needed for self-normalized confidence bounds, and hypervolume evaluation must account for episode-start randomization, which yields a deployable convex hull of return vectors. We propose a protocol-safe reward interface that estimates each reward coordinate via regression and performs scalarization only at query time, and we formalize deployable hypervolume semantics with a stability chain from support-function error to hypervolume error. Consequently, we obtain filtration-safe regret bounds for any predictable preference sequence without discretizing the simplex (only $\log m$ dependence) and matching near-minimax rates in linear MDPs, as well as sharp reward-free PFE guarantees: a (near-)minimax decision-optimal query answering rate $\tilde{O}(d^2 U_{\mathrm{ret}}^2/\varepsilon^2)$ and a tight separation from explicit transition-model recovery $\Theta(d(|\mathcal{S}|-1)/\varepsilon_P^2)$. These results connect online learning, preference-free deployment, and hypervolume-aware evaluation through a single protocol-aligned theory.
Successful Page Load