Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Interactive Learning with Implicit Human Feedback

Reinforcement learning with Human Feedback: Learning Dynamic Choices via Pessimism

Zihao Li · Zhuoran Yang · Mengdi Wang


Abstract:

In this paper we study offline Reinforcement Learning with Human Feedback (RLHF) where we aim to learn the human's underlying reward and the MDP's optimal policy from a set of trajectories induced by human choices. We focus on the Dynamic Discrete Choice (DDC) model for modeling and understanding human choices, which is widely used to model a human decision-making process with forward-looking and bounded rationality. We propose a \underline{D}ynamic-\underline{C}hoice-\underline{P}essimistic-\underline{P}olicy-\underline{O}ptimization (DCPPO) method and prove that the suboptimality of DCPPO \textit{almost} matches the classical pessimistic offline RL algorithm in terms of suboptimality’s dependency on distribution shift and dimension. To the best of our knowledge, this paper presents the first theoretical guarantees for off-policy offline RLHF with dynamic discrete choice model.

Chat is not available.