Timezone: »
Preference-based feedback is important for many applications where direct evaluation of a reward function is not feasible. A notable recent example arises in reinforcement learning from human feedback on large language models. For many of these applications, the cost of acquiring the human feedback can be substantial or even prohibitive. In this work, we take advantage of the fact that often the agent can choose contexts at which to obtain human feedback in order to most efficiently identify a good policy, and introduce the offline contextual dueling bandit setting. We give an upper-confidence-bound style algorithm for this setting and prove a regret bound. We also give empirical confirmation that this method outperforms a similar strategy that uses uniformly sampled contexts.
Author Information
Viraj Mehta (Carnegie Mellon University)
Ojash Neopane (Carnegie Mellon University)
Vikramjeet Das (Carnegie Mellon University)
Sen Lin (Stanford University)
Jeff Schneider (CMU)
Willie Neiswanger (Stanford University)
More from the Same Authors
-
2023 : Distributional Distance Classifiers for Goal-Conditioned Reinforcement Learning »
Ravi Tej Akella · Benjamin Eysenbach · Jeff Schneider · Ruslan Salakhutdinov -
2023 Poster: Learning Temporally AbstractWorld Models without Online Experimentation »
Benjamin Freed · Siddarth Venkatraman · Guillaume Sartoretti · Jeff Schneider · Howie Choset -
2021 Poster: Representational aspects of depth and conditioning in normalizing flows »
Frederic Koehler · Viraj Mehta · Andrej Risteski -
2021 Spotlight: Representational aspects of depth and conditioning in normalizing flows »
Frederic Koehler · Viraj Mehta · Andrej Risteski -
2019 Poster: Myopic Posterior Sampling for Adaptive Goal Oriented Design of Experiments »
Kirthevasan Kandasamy · Willie Neiswanger · Reed Zhang · Akshay Krishnamurthy · Jeff Schneider · Barnabás Póczos -
2019 Oral: Myopic Posterior Sampling for Adaptive Goal Oriented Design of Experiments »
Kirthevasan Kandasamy · Willie Neiswanger · Reed Zhang · Akshay Krishnamurthy · Jeff Schneider · Barnabás Póczos -
2018 Poster: Transformation Autoregressive Networks »
Junier Oliva · Kumar Avinava Dubey · Manzil Zaheer · Barnabás Póczos · Ruslan Salakhutdinov · Eric Xing · Jeff Schneider -
2018 Oral: Transformation Autoregressive Networks »
Junier Oliva · Kumar Avinava Dubey · Manzil Zaheer · Barnabás Póczos · Ruslan Salakhutdinov · Eric Xing · Jeff Schneider -
2017 Poster: Multi-fidelity Bayesian Optimisation with Continuous Approximations »
kirthevasan kandasamy · Gautam Dasarathy · Barnabás Póczos · Jeff Schneider -
2017 Talk: Multi-fidelity Bayesian Optimisation with Continuous Approximations »
kirthevasan kandasamy · Gautam Dasarathy · Barnabás Póczos · Jeff Schneider -
2017 Poster: The Statistical Recurrent Unit »
Junier Oliva · Barnabás Póczos · Jeff Schneider -
2017 Talk: The Statistical Recurrent Unit »
Junier Oliva · Barnabás Póczos · Jeff Schneider