Abstract:
Regret minimization in streaming multi-armed bandits (MABs) has been studied extensively, and recent work has shown that algorithms with $o(K)$ memory have to incur $\Omega(T^{2/3})$ regret, where $K$ and $T$ are the numbers of arms and trials. However, the previous best regret upper bound is still $O(K^{1/3} T^{2/3}\log^{1/3}(T))$, which is achieved by the simple uniform exploration algorithm. In this paper, we close this gap and complete the picture of regret minimization in single-pass streaming MABs. We first improve the regret lower bound to $\Omega(K^{1/3}T^{2/3})$ for algorithms with $o(K)$ memory. We then show that the $\log^{1/3}(T)$ factor is not necessary by designing algorithms with at most $O(\log^*(K))$-arm memory and achieve $O(K^{1/3}T^{2/3})$ expected regret based on streaming $\varepsilon$-best arm algorithms. We further tested the empirical performances of our algorithms on simulated MABs instances, where the proposed algorithms outperform the benchmark uniform exploration algorithm by a large margin and, on occasion, reduce the regret by up to 70%.
Chat is not available.