ICML 2021 Optimal regret algorithm for Pseudo-1d Bandit Convex Optimization Spotlight

Spotlight

Optimal regret algorithm for Pseudo-1d Bandit Convex Optimization

Aadirupa Saha · Nagarajan Natarajan · Praneeth Netrapalli · Prateek Jain

[ Abstract ] [ Visit Optimization (Convex) 2 ] [ Paper ]

[ Paper ]

Abstract: We study online learning with bandit feedback (i.e. learner has access to only zeroth-order oracle) where cost/reward functions

\f_{t}

$\f_t$ admit a "pseudo-1d" structure, i.e.

\f_{t} (\w) = \loss_{t} (\pred_{t} (\w))

$\f_t(\w) = \loss_t(\pred_t(\w))$ where the output of

\pred_{t}

$\pred_t$ is one-dimensional. At each round, the learner observes context

\x_{t}

$\x_t$ , plays prediction

\pred_{t} (\w_{t}; \x_{t})

$\pred_t(\w_t; \x_t)$ (e.g.

\pred_{t} (\cdot) = ⟨ \x_{t}, \cdot ⟩

$\pred_t(\cdot)=\langle \x_t, \cdot\rangle$ ) for some

\w_{t} \in R^{d}

$\w_t \in \mathbb{R}^d$ and observes loss

\loss_{t} (\pred_{t} (\w_{t}))

$\loss_t(\pred_t(\w_t))$ where

\loss_{t}

$\loss_t$ is a convex Lipschitz-continuous function. The goal is to minimize the standard regret metric. This pseudo-1d bandit convex optimization problem (\SBCO) arises frequently in domains such as online decision-making or parameter-tuning in large systems. For this problem, we first show a regret lower bound of

min (\sqrt{d T}, T^{3 / 4})

$\min(\sqrt{dT}, T^{3/4})$ for any algorithm, where

T

$T$ is the number of rounds. We propose a new algorithm \sbcalg that combines randomized online gradient descent with a kernelized exponential weights method to exploit the pseudo-1d structure effectively, guaranteeing the {\em optimal} regret bound mentioned above, up to additional logarithmic factors. In contrast, applying state-of-the-art online convex optimization methods leads to

\tilde{O} (min (d^{9.5} \sqrt{T}, \sqrt{d} T^{3 / 4}))

$\tilde{O}\left(\min\left(d^{9.5}\sqrt{T},\sqrt{d}T^{3/4}\right)\right)$ regret, that is significantly suboptimal in terms of

d

$d$ .

Chat is not available.