Timezone: »
Poster
Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons
Banghua Zhu · Michael Jordan · Jiantao Jiao
We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). We show that when the underlying true reward is linear, under both Bradley-Terry-Luce (BTL) model (pairwise comparison) and Plackett-Luce (PL) model ($K$-wise comparison), MLE converges under certain semi-norm for the family of linear reward. On the other hand, when training a policy based on the learned reward model, we show that MLE fails while a pessimistic MLE provides policies with good performance under certain coverage assumption. We also show that under the PL model, both the true MLE and a different MLE which splits the $K$-wise comparison into pairwise comparisons converge, while the true MLE is asymptotically more efficient. Our results validate the empirical success of the existing RLHF algorithms, and provide new insights for algorithm design. Our analysis can also be applied for the problem of online RLHF and inverse reinforcement learning.
Author Information
Banghua Zhu (University of California, Berkeley)
Michael Jordan (UC Berkeley)
Jiantao Jiao (University of California, Berkeley)
More from the Same Authors
-
2021 : On the Theory of Reinforcement Learning with Once-per-Episode Feedback »
Niladri Chatterji · Aldo Pacchiano · Peter Bartlett · Michael Jordan -
2022 : Representation Learning as Finding Necessary and Sufficient Causes »
Yixin Wang · Michael Jordan -
2022 : Robust Calibration with Multi-domain Temperature Scaling »
Yaodong Yu · Stephen Bates · Yi Ma · Michael Jordan -
2023 : Importance Weighted Actor-Critic for Optimal Conservative Offline Reinforcement Learning »
Hanlin Zhu · Paria Rashidinejad · Jiantao Jiao -
2023 : SCAFF-PD: Communication Efficient Fair and Robust Federated Learning »
Yaodong Yu · Sai Praneeth Karimireddy · Yi Ma · Michael Jordan -
2023 : Federated Conformal Predictors for Distributed Uncertainty Quantification »
Charles Lu · Yaodong Yu · Sai Praneeth Karimireddy · Michael Jordan · Ramesh Raskar -
2023 : Evaluating and Incentivizing Diverse Data Contributions in Collaborative Learning »
Baihe Huang · Sai Praneeth Karimireddy · Michael Jordan -
2023 : Principled Reinforcement Learning with Human Feedback from Pairwise or $K$-wise Comparisons »
Banghua Zhu · Michael Jordan · Jiantao Jiao -
2023 Poster: Jump-Start Reinforcement Learning »
Ikechukwu Uchendu · Ted Xiao · Yao Lu · Banghua Zhu · Mengyuan Yan · Joséphine Simon · Matthew Bennice · Chuyuan Fu · Cong Ma · Jiantao Jiao · Sergey Levine · Karol Hausman -
2023 Poster: Online Learning in Stackelberg Games with an Omniscient Follower »
Geng Zhao · Banghua Zhu · Jiantao Jiao · Michael Jordan -
2023 Poster: Nesterov Meets Optimism: Rate-Optimal Separable Minimax Optimization »
Chris Junchi Li · Huizhuo Yuan · Gauthier Gidel · Quanquan Gu · Michael Jordan -
2023 Poster: Federated Conformal Predictors for Distributed Uncertainty Quantification »
Charles Lu · Yaodong Yu · Sai Praneeth Karimireddy · Michael Jordan · Ramesh Raskar -
2022 : Michael I. Jordan: Learn then Test: Calibrating Predictive Algorithms to Achieve Risk Control »
Michael Jordan -
2022 Poster: No-Regret Learning in Partially-Informed Auctions »
Wenshuo Guo · Michael Jordan · Ellen Vitercik -
2022 Spotlight: No-Regret Learning in Partially-Informed Auctions »
Wenshuo Guo · Michael Jordan · Ellen Vitercik -
2022 Poster: Image-to-Image Regression with Distribution-Free Uncertainty Quantification and Applications in Imaging »
Anastasios Angelopoulos · Amit Pal Kohli · Stephen Bates · Michael Jordan · Jitendra Malik · Thayer Alshaabi · Srigokul Upadhyayula · Yaniv Romano -
2022 Poster: Online Nonsubmodular Minimization with Delayed Costs: From Full Information to Bandit Feedback »
Tianyi Lin · Aldo Pacchiano · Yaodong Yu · Michael Jordan -
2022 Poster: Welfare Maximization in Competitive Equilibrium: Reinforcement Learning for Markov Exchange Economy »
ZHIHAN LIU · Lu Miao · Zhaoran Wang · Michael Jordan · Zhuoran Yang -
2022 Spotlight: Welfare Maximization in Competitive Equilibrium: Reinforcement Learning for Markov Exchange Economy »
ZHIHAN LIU · Lu Miao · Zhaoran Wang · Michael Jordan · Zhuoran Yang -
2022 Spotlight: Image-to-Image Regression with Distribution-Free Uncertainty Quantification and Applications in Imaging »
Anastasios Angelopoulos · Amit Pal Kohli · Stephen Bates · Michael Jordan · Jitendra Malik · Thayer Alshaabi · Srigokul Upadhyayula · Yaniv Romano -
2022 Spotlight: Online Nonsubmodular Minimization with Delayed Costs: From Full Information to Bandit Feedback »
Tianyi Lin · Aldo Pacchiano · Yaodong Yu · Michael Jordan -
2021 : On the Theory of Reinforcement Learning with Once-per-Episode Feedback »
Niladri Chatterji · Aldo Pacchiano · Peter Bartlett · Michael Jordan -
2021 Poster: Provable Meta-Learning of Linear Representations »
Nilesh Tripuraneni · Chi Jin · Michael Jordan -
2021 Poster: Representation Matters: Assessing the Importance of Subgroup Allocations in Training Data »
Esther Rolf · Theodora Worledge · Benjamin Recht · Michael Jordan -
2021 Poster: Resource Allocation in Multi-armed Bandit Exploration: Overcoming Sublinear Scaling with Adaptive Parallelism »
Brijen Thananjeyan · Kirthevasan Kandasamy · Ion Stoica · Michael Jordan · Ken Goldberg · Joseph E Gonzalez -
2021 Spotlight: Provable Meta-Learning of Linear Representations »
Nilesh Tripuraneni · Chi Jin · Michael Jordan -
2021 Oral: Resource Allocation in Multi-armed Bandit Exploration: Overcoming Sublinear Scaling with Adaptive Parallelism »
Brijen Thananjeyan · Kirthevasan Kandasamy · Ion Stoica · Michael Jordan · Ken Goldberg · Joseph E Gonzalez -
2021 Spotlight: Representation Matters: Assessing the Importance of Subgroup Allocations in Training Data »
Esther Rolf · Theodora Worledge · Benjamin Recht · Michael Jordan -
2020 Poster: On Thompson Sampling with Langevin Algorithms »
Eric Mazumdar · Aldo Pacchiano · Yian Ma · Michael Jordan · Peter Bartlett -
2020 Poster: Accelerated Message Passing for Entropy-Regularized MAP Inference »
Jonathan Lee · Aldo Pacchiano · Peter Bartlett · Michael Jordan -
2020 Poster: On Gradient Descent Ascent for Nonconvex-Concave Minimax Problems »
Tianyi Lin · Chi Jin · Michael Jordan -
2020 Poster: Continuous-time Lower Bounds for Gradient-based Algorithms »
Michael Muehlebach · Michael Jordan -
2020 Poster: Stochastic Gradient and Langevin Processes »
Xiang Cheng · Dong Yin · Peter Bartlett · Michael Jordan -
2020 Poster: Learning to Score Behaviors for Guided Policy Optimization »
Aldo Pacchiano · Jack Parker-Holder · Yunhao Tang · Krzysztof Choromanski · Anna Choromanska · Michael Jordan -
2020 Poster: Finite-Time Last-Iterate Convergence for Multi-Agent Learning in Games »
Tianyi Lin · Zhengyuan Zhou · Panayotis Mertikopoulos · Michael Jordan -
2020 Poster: What is Local Optimality in Nonconvex-Nonconcave Minimax Optimization? »
Chi Jin · Praneeth Netrapalli · Michael Jordan -
2019 Poster: Bridging Theory and Algorithm for Domain Adaptation »
Yuchen Zhang · Tianle Liu · Mingsheng Long · Michael Jordan -
2019 Oral: Bridging Theory and Algorithm for Domain Adaptation »
Yuchen Zhang · Tianle Liu · Mingsheng Long · Michael Jordan -
2019 Poster: Transferable Adversarial Training: A General Approach to Adapting Deep Classifiers »
Hong Liu · Mingsheng Long · Jianmin Wang · Michael Jordan -
2019 Poster: Towards Accurate Model Selection in Deep Unsupervised Domain Adaptation »
Kaichao You · Ximei Wang · Mingsheng Long · Michael Jordan -
2019 Poster: A Dynamical Systems Perspective on Nesterov Acceleration »
Michael Muehlebach · Michael Jordan -
2019 Poster: Theoretically Principled Trade-off between Robustness and Accuracy »
Hongyang Zhang · Yaodong Yu · Jiantao Jiao · Eric Xing · Laurent El Ghaoui · Michael Jordan -
2019 Oral: A Dynamical Systems Perspective on Nesterov Acceleration »
Michael Muehlebach · Michael Jordan -
2019 Oral: Towards Accurate Model Selection in Deep Unsupervised Domain Adaptation »
Kaichao You · Ximei Wang · Mingsheng Long · Michael Jordan -
2019 Oral: Transferable Adversarial Training: A General Approach to Adapting Deep Classifiers »
Hong Liu · Mingsheng Long · Jianmin Wang · Michael Jordan -
2019 Oral: Theoretically Principled Trade-off between Robustness and Accuracy »
Hongyang Zhang · Yaodong Yu · Jiantao Jiao · Eric Xing · Laurent El Ghaoui · Michael Jordan -
2019 Poster: On Efficient Optimal Transport: An Analysis of Greedy and Accelerated Mirror Descent Algorithms »
Tianyi Lin · Nhat Ho · Michael Jordan -
2019 Poster: Rao-Blackwellized Stochastic Gradients for Discrete Distributions »
Runjing Liu · Jeffrey Regier · Nilesh Tripuraneni · Michael Jordan · Jon McAuliffe -
2019 Oral: Rao-Blackwellized Stochastic Gradients for Discrete Distributions »
Runjing Liu · Jeffrey Regier · Nilesh Tripuraneni · Michael Jordan · Jon McAuliffe -
2019 Oral: On Efficient Optimal Transport: An Analysis of Greedy and Accelerated Mirror Descent Algorithms »
Tianyi Lin · Nhat Ho · Michael Jordan -
2018 Poster: On the Theory of Variance Reduction for Stochastic Gradient Monte Carlo »
Niladri Chatterji · Nicolas Flammarion · Yian Ma · Peter Bartlett · Michael Jordan -
2018 Poster: RLlib: Abstractions for Distributed Reinforcement Learning »
Eric Liang · Richard Liaw · Robert Nishihara · Philipp Moritz · Roy Fox · Ken Goldberg · Joseph E Gonzalez · Michael Jordan · Ion Stoica -
2018 Oral: On the Theory of Variance Reduction for Stochastic Gradient Monte Carlo »
Niladri Chatterji · Nicolas Flammarion · Yian Ma · Peter Bartlett · Michael Jordan -
2018 Oral: RLlib: Abstractions for Distributed Reinforcement Learning »
Eric Liang · Richard Liaw · Robert Nishihara · Philipp Moritz · Roy Fox · Ken Goldberg · Joseph E Gonzalez · Michael Jordan · Ion Stoica -
2018 Poster: SAFFRON: an Adaptive Algorithm for Online Control of the False Discovery Rate »
Aaditya Ramdas · Tijana Zrnic · Martin Wainwright · Michael Jordan -
2018 Oral: SAFFRON: an Adaptive Algorithm for Online Control of the False Discovery Rate »
Aaditya Ramdas · Tijana Zrnic · Martin Wainwright · Michael Jordan -
2018 Poster: Learning to Explain: An Information-Theoretic Perspective on Model Interpretation »
Jianbo Chen · Le Song · Martin Wainwright · Michael Jordan -
2018 Oral: Learning to Explain: An Information-Theoretic Perspective on Model Interpretation »
Jianbo Chen · Le Song · Martin Wainwright · Michael Jordan -
2017 Poster: How to Escape Saddle Points Efficiently »
Chi Jin · Rong Ge · Praneeth Netrapalli · Sham Kakade · Michael Jordan -
2017 Talk: How to Escape Saddle Points Efficiently »
Chi Jin · Rong Ge · Praneeth Netrapalli · Sham Kakade · Michael Jordan -
2017 Poster: Deep Transfer Learning with Joint Adaptation Networks »
Mingsheng Long · Han Zhu · Jianmin Wang · Michael Jordan -
2017 Poster: Breaking Locality Accelerates Block Gauss-Seidel »
Stephen Tu · Shivaram Venkataraman · Ashia Wilson · Alex Gittens · Michael Jordan · Benjamin Recht -
2017 Talk: Deep Transfer Learning with Joint Adaptation Networks »
Mingsheng Long · Han Zhu · Jianmin Wang · Michael Jordan -
2017 Talk: Breaking Locality Accelerates Block Gauss-Seidel »
Stephen Tu · Shivaram Venkataraman · Ashia Wilson · Alex Gittens · Michael Jordan · Benjamin Recht