Timezone: »
Contextual Bandits and Imitation Learning with Preference-Based Active Queries
Ayush Sekhari · Karthik Sridharan · Wen Sun · Runzhe Wu
Event URL: https://openreview.net/forum?id=zUM6t3bYZp »
We consider the problem of contextual bandits and imitation learning, where the learner lacks direct knowledge of the executed action's reward. Instead, the learner can actively request the expert at each round to compare two actions and receive noisy preference feedback. The learner's objective is two-fold: to minimize regret associated with the executed actions, while simultaneously, minimizing the number of comparison queries made to the expert. In this paper, we assume that the learner has access to a function class that can represent the expert's preference model under appropriate link functions and present an algorithm that leverages an online regression oracle with respect to this function class. For the contextual bandit setting, our algorithm achieves a regret bound that combines the best of both worlds, scaling as $O(\min\\{\sqrt{T}, d/\Delta\\})$, where $T$ represents the number of interactions, $d$ represents the eluder dimension of the function class, and $\Delta$ represents the minimum preference of the optimal action over any suboptimal action under all contexts. Our algorithm does not require the knowledge of $\Delta$, and the obtained regret bound is comparable to what can be achieved in the standard contextual bandits setting where the learner observes reward signals at each round. Additionally, our algorithm makes only $O(\min\\{T, d^2/\Delta^2\\})$ queries to the expert. We then extend our algorithm to the imitation learning setting, where the agent engages with an unknown environment in episodes of length $H$, and provide similar guarantees regarding regret and query complexity. Interestingly, with preference-based feedback, our imitation learning algorithm can learn a policy outperforming a sub-optimal expert, matching the result from interactive imitation learning algorithms [Ross and Bagnell, 2014] that require access to the expert's actions and also reward signals.
We consider the problem of contextual bandits and imitation learning, where the learner lacks direct knowledge of the executed action's reward. Instead, the learner can actively request the expert at each round to compare two actions and receive noisy preference feedback. The learner's objective is two-fold: to minimize regret associated with the executed actions, while simultaneously, minimizing the number of comparison queries made to the expert. In this paper, we assume that the learner has access to a function class that can represent the expert's preference model under appropriate link functions and present an algorithm that leverages an online regression oracle with respect to this function class. For the contextual bandit setting, our algorithm achieves a regret bound that combines the best of both worlds, scaling as $O(\min\\{\sqrt{T}, d/\Delta\\})$, where $T$ represents the number of interactions, $d$ represents the eluder dimension of the function class, and $\Delta$ represents the minimum preference of the optimal action over any suboptimal action under all contexts. Our algorithm does not require the knowledge of $\Delta$, and the obtained regret bound is comparable to what can be achieved in the standard contextual bandits setting where the learner observes reward signals at each round. Additionally, our algorithm makes only $O(\min\\{T, d^2/\Delta^2\\})$ queries to the expert. We then extend our algorithm to the imitation learning setting, where the agent engages with an unknown environment in episodes of length $H$, and provide similar guarantees regarding regret and query complexity. Interestingly, with preference-based feedback, our imitation learning algorithm can learn a policy outperforming a sub-optimal expert, matching the result from interactive imitation learning algorithms [Ross and Bagnell, 2014] that require access to the expert's actions and also reward signals.
Author Information
Ayush Sekhari (Cornell University)
Karthik Sridharan (Cornell University)
Wen Sun (Cornell University)
Runzhe Wu (Cornell University)
More from the Same Authors
-
2021 : Remember What You Want to Forget: Algorithms for Machine Unlearning »
Ayush Sekhari · Ayush Sekhari · Jayadev Acharya · Gautam Kamath · Ananda Theertha Suresh -
2021 : Corruption Robust Offline Reinforcement Learning »
Xuezhou Zhang · Yiding Chen · Jerry Zhu · Wen Sun -
2021 : Mitigating Covariate Shift in Imitation Learning via Offline Data Without Great Coverage »
Jonathan Chang · Masatoshi Uehara · Dhruv Sreenivas · Rahul Kidambi · Wen Sun -
2021 : MobILE: Model-Based Imitation Learning From Observation Alone »
Rahul Kidambi · Jonathan Chang · Wen Sun -
2023 : When is Agnostic Reinforcement Learning Statistically Tractable? »
Gene Li · Zeyu Jia · Alexander Rakhlin · Ayush Sekhari · Nati Srebro -
2023 : Representation Learning in Low-rank Slate-based Recommender Systems »
Yijia Dai · Wen Sun -
2023 : Provable Offline Reinforcement Learning with Human Feedback »
Wenhao Zhan · Masatoshi Uehara · Nathan Kallus · Jason Lee · Wen Sun -
2023 : Selective Sampling and Imitation Learning via Online Regression »
Ayush Sekhari · Karthik Sridharan · Wen Sun · Runzhe Wu -
2023 : Provable Offline Reinforcement Learning with Human Feedback »
Wenhao Zhan · Masatoshi Uehara · Nathan Kallus · Jason Lee · Wen Sun -
2023 : How to Query Human Feedback Efficiently in RL? »
Wenhao Zhan · Masatoshi Uehara · Wen Sun · Jason Lee -
2023 : Contextual Bandits and Imitation Learning with Preference-Based Active Queries »
Ayush Sekhari · Karthik Sridharan · Wen Sun · Runzhe Wu -
2023 : How to Query Human Feedback Efficiently in RL? »
Wenhao Zhan · Masatoshi Uehara · Wen Sun · Jason Lee -
2023 Poster: Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR »
Kaiwen Wang · Nathan Kallus · Wen Sun -
2023 Poster: Multi-task Representation Learning for Pure Exploration in Linear Bandits »
Yihan Du · Longbo Huang · Wen Sun -
2023 Poster: Distributional Offline Policy Evaluation with Predictive Error Guarantees »
Runzhe Wu · Masatoshi Uehara · Wen Sun -
2023 Poster: Computationally Efficient PAC RL in POMDPs with Latent Determinism and Conditional Embeddings »
Masatoshi Uehara · Ayush Sekhari · Jason Lee · Nathan Kallus · Wen Sun -
2022 Poster: Efficient Reinforcement Learning in Block MDPs: A Model-free Representation Learning approach »
Xuezhou Zhang · Yuda Song · Masatoshi Uehara · Mengdi Wang · Alekh Agarwal · Wen Sun -
2022 Poster: Learning Bellman Complete Representations for Offline Policy Evaluation »
Jonathan Chang · Kaiwen Wang · Nathan Kallus · Wen Sun -
2022 Spotlight: Efficient Reinforcement Learning in Block MDPs: A Model-free Representation Learning approach »
Xuezhou Zhang · Yuda Song · Masatoshi Uehara · Mengdi Wang · Alekh Agarwal · Wen Sun -
2022 Oral: Learning Bellman Complete Representations for Offline Policy Evaluation »
Jonathan Chang · Kaiwen Wang · Nathan Kallus · Wen Sun -
2022 Poster: Guarantees for Epsilon-Greedy Reinforcement Learning with Function Approximation »
Chris Dann · Yishay Mansour · Mehryar Mohri · Ayush Sekhari · Karthik Sridharan -
2022 Spotlight: Guarantees for Epsilon-Greedy Reinforcement Learning with Function Approximation »
Chris Dann · Yishay Mansour · Mehryar Mohri · Ayush Sekhari · Karthik Sridharan -
2021 Poster: Fairness of Exposure in Stochastic Bandits »
Luke Lequn Wang · Yiwei Bai · Wen Sun · Thorsten Joachims -
2021 Spotlight: Fairness of Exposure in Stochastic Bandits »
Luke Lequn Wang · Yiwei Bai · Wen Sun · Thorsten Joachims -
2021 Poster: Robust Policy Gradient against Strong Data Corruption »
Xuezhou Zhang · Yiding Chen · Jerry Zhu · Wen Sun -
2021 Spotlight: Robust Policy Gradient against Strong Data Corruption »
Xuezhou Zhang · Yiding Chen · Jerry Zhu · Wen Sun -
2021 Poster: Bilinear Classes: A Structural Framework for Provable Generalization in RL »
Simon Du · Sham Kakade · Jason Lee · Shachar Lovett · Gaurav Mahajan · Wen Sun · Ruosong Wang -
2021 Oral: Bilinear Classes: A Structural Framework for Provable Generalization in RL »
Simon Du · Sham Kakade · Jason Lee · Shachar Lovett · Gaurav Mahajan · Wen Sun · Ruosong Wang -
2021 Poster: PC-MLP: Model-based Reinforcement Learning with Policy Cover Guided Exploration »
Yuda Song · Wen Sun -
2021 Spotlight: PC-MLP: Model-based Reinforcement Learning with Policy Cover Guided Exploration »
Yuda Song · Wen Sun -
2019 Poster: Distributed Learning with Sublinear Communication »
Jayadev Acharya · Christopher De Sa · Dylan Foster · Karthik Sridharan -
2019 Oral: Distributed Learning with Sublinear Communication »
Jayadev Acharya · Christopher De Sa · Dylan Foster · Karthik Sridharan -
2019 Poster: Training Well-Generalizing Classifiers for Fairness Metrics and Other Data-Dependent Constraints »
Andrew Cotter · Maya Gupta · Heinrich Jiang · Nati Srebro · Karthik Sridharan · Serena Wang · Blake Woodworth · Seungil You -
2019 Oral: Training Well-Generalizing Classifiers for Fairness Metrics and Other Data-Dependent Constraints »
Andrew Cotter · Maya Gupta · Heinrich Jiang · Nati Srebro · Karthik Sridharan · Serena Wang · Blake Woodworth · Seungil You