Rethinking the Hardness of PbRL: A Provable General Regret Bound
Chenjie Mao ⋅ Yi Fan ⋅ Ning Zhang ⋅ Chongjie Zhang
Abstract
This paper studies \emph{preference-based reinforcement learning} (PbRL), where agents learn from comparative, trajectory-level feedback rather than numeric rewards. While PbRL has seen rapid empirical and theoretical progress, existing analyses are largely confined to restricted settings and fail to jointly capture the outcome-based and comparison-based nature of preference feedback. We prove that under a broad \emph{general function approximation} framework, PbRL admits a $\sqrt{T}$ regret guarantee. In particular, we introduce a simple and provably efficient algorithm, \emph{Recursive Trajectory-based Preference Q-Learning} (RTPQ), and establish its regret bound while explicitly accounting for the trajectory-level and comparative structure of preferences. Our analysis is characterized by a new complexity measure, the \emph{Dual Episodic Eluder Dimension} (DEED), which quantifies the intrinsic difficulty of PbRL. We show that for linear MDPs, the DEED scales as $\mathcal{O}(dH)$, yielding a regret bound of $\tilde{\mathcal{O}}(dH\sqrt{T}\max(H^{3/2},\,1/\kappa))$, where $\kappa$ is a problem-dependent constant. This bound is near-optimal up to horizon- and problem-dependent factors when compared to standard reward-based linear MDPs. In addition, our framework recovers the best-known regret bounds in the special cases of dueling bandits and standard outcome-based reinforcement learning. Overall, our results provide a general regret guarantee for PbRL with outcome-based preference feedback and broad function approximation.
Successful Page Load