Workshop: Complex feedback in online learning

Provably Correct SGD-based Exploration for Linear Bandit

Jialin Dong · Lin Yang

Abstract: Linear bandits are commonly applied to applications with sequential decision-making, where balancing the exploration and exploitation is of paramount importance. Yet existing approaches, e.g., upper confidence bound (UCB) algorithms, usually suffer from large computation complexity per round, can be overly pessimistic, and usually make sudden updates to the learned policies. In this paper, we take an online stochastic gradient descent (SGD) based approach to perform incremental updates on the estimated parameters. Our approach not only saves memory and computation but also gives smooth update to the learned policy. Theoretically, we prove that our proposed algorithm can achieve $\tilde O(d\sqrt{n})$ regret, where $n$ is the number of total time steps of the algorithm, matching existing near-optimal regret bounds in UCB-type algorithms. Furthermore, our approach does not require the ``diversity'' conditions, which bound the minimum eigenvalues of the covariance matrix, as presented in related literature. We further conduct experiments to demonstrate the consistently superb performance of our algorithms.

Chat is not available.