Timezone: »

Reinforcement learning with Human Feedback: Learning Dynamic Choices via Pessimism
Zihao Li · Zhuoran Yang · Mengdi Wang
Event URL: https://openreview.net/forum?id=gxM2AUFMsK »

In this paper we study offline Reinforcement Learning with Human Feedback (RLHF) where we aim to learn the human's underlying reward and the MDP's optimal policy from a set of trajectories induced by human choices. We focus on the Dynamic Discrete Choice (DDC) model for modeling and understanding human choices, which is widely used to model a human decision-making process with forward-looking and bounded rationality. We propose a \underline{D}ynamic-\underline{C}hoice-\underline{P}essimistic-\underline{P}olicy-\underline{O}ptimization (DCPPO) method and prove that the suboptimality of DCPPO \textit{almost} matches the classical pessimistic offline RL algorithm in terms of suboptimality’s dependency on distribution shift and dimension. To the best of our knowledge, this paper presents the first theoretical guarantees for off-policy offline RLHF with dynamic discrete choice model.

Author Information

Zihao Li (Princeton University)
Zhuoran Yang (Yale University)
Mengdi Wang (Alibaba Group)

More from the Same Authors