Timezone: »

UCB Provably Learns From Inconsistent Human Feedback
Shuo Yang · Tongzheng Ren · Inderjit Dhillon · Sujay Sanghavi
Event URL: https://openreview.net/forum?id=ylbkpVlWmY »
In this paper, we study how to learn from inconsistent human feedback in the setting of combinatorial bandits with semi-bandit feedback -- where an online learner in every time step chooses a size-$k$ set of arms, observes a stochastic reward for each arm, and endeavors to maximize the sum of the per-arm rewards in the set. We consider the challenging setting where these per-arm rewards are not only set-dependent, but also {\em inconsistent:} the expected reward of arm "a" can be larger than arm "b" in one set, but smaller in another. Inconsistency is often observed in practice, falls outside the purview of many popular semi-bandit models, and in general can result in it being combinatorially hard to find the optimal set.Motivated by the observed practice of using UCB-based algorithms even in settings where they are not strictly justified, our main contribution is to present a simple assumption - weak optimal set consistency. We show that this assumption allows for inconsistent set-dependent arm rewards, and also subsumes many widely used models for semi-bandit feedback. Most importantly, we show that it ensures that a simple UCB-based algorithm finds the optimal set, and achieves $O\left(\min(\frac{k^3 n \log T}{\epsilon}, k^2\sqrt{n T \log T})\right)$ regret (which nearly matches the lower bound).

Author Information

Shuo Yang (University of Texas at Austin)
Tongzheng Ren (UT Austin / Google Brain)
Inderjit Dhillon (UT Austin & Amazon)

Inderjit Dhillon is the Gottesman Family Centennial Professor of Computer Science and Mathematics at UT Austin, where he is also the Director of the ICES Center for Big Data Analytics. His main research interests are in big data, machine learning, network analysis, linear algebra and optimization. He received his B.Tech. degree from IIT Bombay, and Ph.D. from UC Berkeley. Inderjit has received several awards, including the ICES Distinguished Research Award, the SIAM Outstanding Paper Prize, the Moncrief Grand Challenge Award, the SIAM Linear Algebra Prize, the University Research Excellence Award, and the NSF Career Award. He has published over 160 journal and conference papers, and has served on the Editorial Board of the Journal of Machine Learning Research, the IEEE Transactions of Pattern Analysis and Machine Intelligence, Foundations and Trends in Machine Learning and the SIAM Journal for Matrix Analysis and Applications. Inderjit is an ACM Fellow, an IEEE Fellow, a SIAM Fellow and an AAAS Fellow.

Sujay Sanghavi (UT Austin)

More from the Same Authors