Timezone: »

Provably Correct SGD-based Exploration for Linear Bandit
Jialin Dong · Lin Yang
Linear bandits are commonly applied to applications with sequential decision-making, where balancing the exploration and exploitation is of paramount importance. Yet existing approaches, e.g., upper confidence bound (UCB) algorithms, usually suffer from large computation complexity per round, can be overly pessimistic, and usually make sudden updates to the learned policies. In this paper, we take an online stochastic gradient descent (SGD) based approach to perform incremental updates on the estimated parameters. Our approach not only saves memory and computation but also gives smooth update to the learned policy. Theoretically, we prove that our proposed algorithm can achieve $\tilde O(d\sqrt{n})$ regret, where $n$ is the number of total time steps of the algorithm, matching existing near-optimal regret bounds in UCB-type algorithms. Furthermore, our approach does not require the ``diversity'' conditions, which bound the minimum eigenvalues of the covariance matrix, as presented in related literature. We further conduct experiments to demonstrate the consistently superb performance of our algorithms.

Author Information

Jialin Dong (University of California, Los Angeles)
Lin Yang (UCLA)

More from the Same Authors