Beyond $log^2(T)$ regret for decentralized bandits in matching markets

Soumya Basu · Karthik Abinav Sankararaman · Abishek Sankararaman

Keywords: [ Reinforcement Learning and Planning -> Bandits ]

Abstract: We design decentralized algorithms for regret minimization in the two sided matching market with one-sided bandit feedback that significantly improves upon the prior works (Liu et al.\,2020a, Sankararaman et al.\,2020, Liu et al.\,2020b). First, for general markets, for any $\varepsilon > 0$, we design an algorithm that achieves a $O(\log^{1+\varepsilon}(T))$ regret to the agent-optimal stable matching, with unknown time horizon $T$, improving upon the $O(\log^{2}(T))$ regret achieved in (Liu et al.\,2020b). Second, we provide the optimal $\Theta(\log(T))$ agent-optimal regret for markets satisfying {\em uniqueness consistency} -- markets where leaving participants don't alter the original stable matching. Previously, $\Theta(\log(T))$ regret was achievable (Sankararaman et al.\,2020, Liu et al.\,2020b) in the much restricted {\em serial dictatorship} setting, when all arms have the same preference over the agents. We propose a phase based algorithm, where in each phase, besides deleting the globally communicated dominated arms the agents locally delete arms with which they collide often. This \emph{local deletion} is pivotal in breaking deadlocks arising from rank heterogeneity of agents across arms. We further demonstrate superiority of our algorithm over existing works through simulations.

Chat is not available.