The Cost of Information: Phase Transitions in Contextual Bandits with Paid Observations
Abstract
Contextual bandits serve as a foundational framework for sequential decision-making in domains like recommendation systems, IoT device management, and conversational AI, yet classical models overlook a critical practical constraint: acquiring extra observations incurs non-trivial costs, creating an unaddressed trade-off between information gain and expenditure. To fill this gap, we study contextual bandits with paid observations, where the learner actively chooses which actions to observe (at a specified cost) in each round, with the goal of minimizing total regret that combines learning losses and cumulative observation costs. We first design a near-optimal algorithm for adversarial environments, proving it achieves a regret rate significantly higher than that of cost-free contextual bandits—even for small observation costs—thus quantifying how paid observations reshape learning complexity. We then uncover a critical phase-transition phenomenon when incorporating free observation budgets: below a threshold budget, free observations only reduce total costs without changing the underlying regret rate, while above this threshold, they drastically improve learning efficiency, lowering the regret rate to match that of cost-free settings. To leverage this phenomenon, we develop a meta-controller that adaptively switches strategies based on the available budget, ensuring near-optimal performance across both low- and high-budget regimes. Furthermore, to address practical challenges like infinite policy spaces and computational inefficiency, we propose an oracle-efficient algorithm under a function approximation framework, which leverages an online regression oracle to maintain strong performance for stochastic losses. Our results also shed light on the scenarios about switching costs, budgeted constraints, model misspecification and the trade-offs involved in knapsack problems. Finally, we conduct numerical experiments to validate our theoretical findings and demonstrate practical efficiency. Key words: Contextual bandit, paid observation, function approximation, phase transition.