ICML 2021 Dynamic Planning and Learning under Recovering Rewards Spotlight

Spotlight

Dynamic Planning and Learning under Recovering Rewards

David Simchi-Levi · Zeyu Zheng · Feng Zhu

[ Abstract ] [ Visit Bandits 2 ] [ Paper ]

[ Paper ]

Abstract: Motivated by emerging applications such as live-streaming e-commerce, promotions and recommendations, we introduce a general class of multi-armed bandit problems that have the following two features: (i) the decision maker can pull and collect rewards from at most

K

$K$ out of

N

$N$ different arms in each time period; (ii) the expected reward of an arm immediately drops after it is pulled, and then non-parametrically recovers as the idle time increases. With the objective of maximizing expected cumulative rewards over

T

$T$ time periods, we propose, construct and prove performance guarantees for a class of

Purely Periodic Policies''. For the offline problem when all model parameters are known, our proposed policy obtains an approximation ratio that is at the order of

1 - O (1 / \sqrt{K})

$1-\mathcal O(1/\sqrt{K})$ , which is asymptotically optimal when

K

$K$ grows to infinity. For the online problem when the model parameters are unknown and need to be learned, we design an Upper Confidence Bound (UCB) based policy that approximately has $\widetilde{\mathcal O}(N\sqrt{T}) regret against the offline benchmark. Our framework and policy design may have the potential to be adapted into other offline planning and online learning applications with non-stationary and recovering rewards.

Chat is not available.