Timezone: »

Contextual Set Selection Under Human Feedback With Model Misspecification
Shuo Yang · Rajat Sen · Sujay Sanghavi
Event URL: https://openreview.net/forum?id=6Z2uBx5bpZ »
A common and efficient way to elicit human feedback is to present users with a set of options, and record their relative preferences on the presented options. The contextual combinatorial bandits problem captures this setting algorithmically; however, it implicitly assumes an underlying consistent reward model for the options. The setting of human feedback (which e.g. may use different reviewers for different samples) means that there may not be any such model -- it is *misspecified*. We first derive a lower-bound for our setting, and then show that model misspecification can lead to catastrophic failure of the C$^2$UCB algorithm (which is otherwise near-optimal when there is no misspecification). We then propose two algorithms: the first algorithm (MC$^2$UCB) requires knowledge of the level of misspecification $\epsilon$ (i.e., the absolute deviation from the closest well-specified model). The second algorithm is a general framework that extends to unknown $\epsilon$. Our theoretical analysis shows that both algorithms achieve near-optimal regret. Further empirical evaluations, conducted both in a synthetic environment and a real-world application of movie recommendations, demonstrate the adaptability of our algorithm to various degrees of misspecification. This highlights the algorithm's ability to effectively learn from human feedback, even with model misspecification.

Author Information

Shuo Yang (University of Texas at Austin)
Rajat Sen (Google Research)

I am a 4th year PhD. student in WNCG, UT Austin. I am advised by [Dr. Sanjay Shakkottai](http://users.ece.utexas.edu/~shakkott/Sanjay_Shakkottai/Contact.html). I received my Bachelors degree in ECE, IIT Kharagpur in 2013. I have spent most of my childhood in Durgapur and Kolkata, West Bengal, India. My research interests include online learning (especially Multi-Armed Bandit problems), causality, learning in queueing systems, recommendation systems and social networks. I like to work on real-world problems that allow rigorous theoretical analysis.

Sujay Sanghavi (UT Austin)

More from the Same Authors