Adaptive Bandit Algorithms for Contextual Matching Markets
Abstract
We study bandit learning in matching markets, where players and arms constitute the two market sides, and the players' utilities are linear in the arm contexts. In each round, new arms arrive with observable contexts. Then, the algorithm matches them to players, aiming to minimize each player's regret against a stable matching benchmark. This contextual structure creates significant complexity: subtle context shifts can slightly alter one player's utility while completely reconfiguring the underlying benchmark, causing large regret spikes for others. We address this in two settings: stochastic contexts, drawn from a latent distribution, and adversarial contexts, which may be arbitrary. In the stochastic setting, we introduce a novel minimum preference gap to characterize learning difficulty; in the adversarial setting, we propose a tractable regret notion that remains valid under arbitrary contexts. We develop fully adaptive algorithms for both settings, establishing instance-dependent poly-logarithmic regret upper bounds. In the stochastic case, we also prove matching instance-independent regret upper and lower bounds under a mild assumption on the context distribution.