T: Online Learning and Bandits

Ballroom 3 & 4

Moderator: Ziteng Sun


Chat is not available.

Thu 21 July 12:30 - 12:35 PDT

Nearly Optimal Catoni’s M-estimator for Infinite Variance

Sujay Bhatt · Guanhua Fang · Ping Li · Gennady Samorodnitsky

In this paper, we extend the remarkable M-estimator of Catoni~\citep{Cat12} to situations where the variance is infinite. In particular, given a sequence of i.i.d random variables~$\{X_i\}_{i=1}^n$ from distribution~$\mathcal{D}$ over~$\mathbb{R}$ with mean~$\mu$, we only assume the existence of a known upper bound~$\upsilon_{\varepsilon} > 0$ on the~$(1+\varepsilon)^{th}$ central moment of the random variables, namely, for~$\varepsilon \in (0,1]$ \[ \mathbb{E}_{X_1 \sim \mathcal{D}} \Big| X_1 - \mu \Big|^{1+\varepsilon} \leq \upsilon_{\varepsilon}. \] The extension is non-trivial owing to the difficulty in characterizing the roots of certain polynomials of degree smaller than~$2$. The proposed estimator has the same order of magnitude and the same asymptotic constant as in~\citet{Cat12}, but for the case of bounded moments. We further propose a version of the estimator that does not require even the knowledge of~$\upsilon_{\varepsilon}$, but adapts the moment bound in a data-driven manner. Finally, to illustrate the usefulness of the derived non-asymptotic confidence bounds, we consider an application in multi-armed bandits and propose best arm identification algorithms, in the fixed confidence setting, that outperform the state of the art.

Thu 21 July 12:35 - 12:40 PDT

Strategies for Safe Multi-Armed Bandits with Logarithmic Regret and Risk

Tianrui Chen · Aditya Gangrade · Venkatesh Saligrama

We investigate a natural but surprisingly unstudied approach to the multi-armed bandit problem under safety risk constraints. Each arm is associated with an unknown law on safety risks and rewards, and the learner's goal is to maximise reward whilst not playing unsafe arms, as determined by a given threshold on the mean risk.We formulate a pseudo-regret for this setting that enforces this safety constraint in a per-round way by softly penalising any violation, regardless of the gain in reward due to the same. This has practical relevance to scenarios such as clinical trials, where one must maintain safety for each round rather than in an aggregated sense.We describe doubly optimistic strategies for this scenario, which maintain optimistic indices for both safety risk and reward. We show that schema based on both frequentist and Bayesian indices satisfy tight gap-dependent logarithmic regret bounds, and further that these play unsafe arms only logarithmically many times in total. This theoretical analysis is complemented by simulation studies demonstrating the effectiveness of the proposed schema, and probing the domains in which their use is appropriate.

Thu 21 July 12:40 - 12:45 PDT

Local Linear Convergence of Douglas-Rachford for Linear Programming: a Probabilistic Analysis

Oisin Faust · Hamza Fawzi

Douglas-Rachford splitting/ADMM (henceforth DRS) is a very popular algorithm for solving convex optimisation problems to low or moderate accuracy, and in particular for solving large-scale linear programs. Despite recent progress, obtaining highly accurate solutions to linear programs with DRS remains elusive. In this paper we analyze the local linear convergence rate $r$ of the DRS method for random linear programs, and give explicit and tight bounds on $r$. We show that $1-r^2$ is typically of the order of $m^{-1}(n-m)^{-1}$, where $n$ is the number of variables and $m$ is the number of constraints.This provides a quantitative explanation for the very slow convergence of DRS/ADMM on random LPs. The proof of our result relies on an established characterisation of the linear rate of convergence as the cosine of the Friedrichs angle between two subspaces associated to the problem. We also show that the cosecant of this angle can be interpreted as a condition number for the LP.The proof of our result relies on a characterization of the linear rate of convergence as the cosine of the Friedrichs angle between two subspaces associated to the problem. We also show that the cosecant of this angle can be interpreted as a condition number for the LP.

Thu 21 July 12:45 - 12:50 PDT

Contextual Information-Directed Sampling

Botao Hao · Tor Lattimore · Chao Qin

Information-directed sampling (IDS) has recently demonstrated its potential as a data-efficient reinforcement learning algorithm. However, it is still unclear what is the right form of information ratio to optimize when contextual information is available. We investigate the IDS design through two contextual bandit problems: contextual bandits with graph feedback and sparse linear contextual bandits. We provably demonstrate the advantage of \emph{contextual IDS} over \emph{conditional IDS} and emphasize the importance of considering the context distribution. The main message is that an intelligent agent should invest more on the actions that are beneficial for the future unseen contexts while the conditional IDS can be myopic. We further propose a computationally-efficient version of contextual IDS based on Actor-Critic and evaluate it empirically on a neural network contextual bandit.

Thu 21 July 12:50 - 12:55 PDT

Breaking the $\sqrt{T}$ Barrier: Instance-Independent Logarithmic Regret in Stochastic Contextual Linear Bandits

Avishek Ghosh · Abishek Sankararaman

We prove an instance independent (poly) logarithmic regret for stochastic contextual bandits with linear payoff. Previously, in \cite{chu2011contextual}, a lower bound of $\mathcal{O}(\sqrt{T})$ is shown for the contextual linear bandit problem with arbitrary (adversarily chosen) contexts. In this paper, we show that stochastic contexts indeed help to reduce the regret from $\sqrt{T}$ to $\polylog(T)$. We propose Low Regret Stochastic Contextual Bandits (\texttt{LR-SCB}), which takes advantage of the stochastic contexts and performs parameter estimation (in $\ell_2$ norm) and regret minimization simultaneously. \texttt{LR-SCB} works in epochs, where the parameter estimation of the previous epoch is used to reduce the regret of the current epoch. The (poly) logarithmic regret of \texttt{LR-SCB} stems from two crucial facts: (a) the application of a norm adaptive algorithm to exploit the parameter estimation and (b) an analysis of the shifted linear contextual bandit algorithm, showing that shifting results in increasing regret. We have also shown experimentally that stochastic contexts indeed incurs a regret that scales with $\polylog(T)$.

Thu 21 July 12:55 - 13:00 PDT

Universal and data-adaptive algorithms for model selection in linear contextual bandits

Vidya Muthukumar · Akshay Krishnamurthy

Model selection in contextual bandits is an important complementary problem to regret minimization with respect to a fixed model class. We consider the simplest non-trivial instance of model-selection: distinguishing a simple multi-armed bandit problem from a linear contextual bandit problem. Even in this instance, current state-of-the-art methods explore in a suboptimal manner and require strong "feature-diversity" conditions. In this paper, we introduce new algorithms that a) explore in a data-adaptive manner, and b) provide model selection guarantees of the form O(d^{\alpha} T^{1 - \alpha}) with no feature diversity conditions whatsoever, where d denotes the dimension of the linear model and T denotes the total number of rounds. The first algorithm enjoys a "best-of-both-worlds" property, recovering two prior results that hold under distinct distributional assumptions, simultaneously. The second removes distributional assumptions altogether, expanding the scope for tractable model selection. Our approach extends to model selection among nested linear contextual bandits under some additional assumptions.

Thu 21 July 13:00 - 13:05 PDT

Regret Minimization with Performative Feedback

Meena Jagadeesan · Tijana Zrnic · Celestine Mendler-Dünner

In performative prediction, the deployment of a predictive model triggers a shift in the data distribution. As these shifts are typically unknown ahead of time, the learner needs to deploy a model to get feedback about the distribution it induces. We study the problem of finding near-optimal models under performativity while maintaining low regret. On the surface, this problem might seem equivalent to a bandit problem. However, it exhibits a fundamentally richer feedback structure that we refer to as performative feedback: after every deployment, the learner receives samples from the shifted distribution rather than bandit feedback about the reward. Our main contribution is regret bounds that scale only with the complexity of the distribution shifts and not that of the reward function. The key algorithmic idea is careful exploration of the distribution shifts that informs a novel construction of confidence bounds on the risk of unexplored models. The construction only relies on smoothness of the shifts and does not assume convexity. More broadly, our work establishes a conceptual approach for leveraging tools from the bandits literature for the purpose of regret minimization with performative feedback.

Thu 21 July 13:05 - 13:25 PDT

A Simple yet Universal Strategy for Online Convex Optimization

Lijun Zhang · Guanghui Wang · Jinfeng Yi · Tianbao Yang

Recently, several universal methods have been proposed for online convex optimization, and attain minimax rates for multiple types of convex functions simultaneously. However, they need to design and optimize one surrogate loss for each type of functions, making it difficult to exploit the structure of the problem and utilize existing algorithms. In this paper, we propose a simple strategy for universal online convex optimization, which avoids these limitations. The key idea is to construct a set of experts to process the original online functions, and deploy a meta-algorithm over the linearized losses to aggregate predictions from experts. Specifically, the meta-algorithm is required to yield a second-order bound with excess losses, so that it can leverage strong convexity and exponential concavity to control the meta-regret. In this way, our strategy inherits the theoretical guarantee of any expert designed for strongly convex functions and exponentially concave functions, up to a double logarithmic factor. As a result, we can plug in off-the-shelf online solvers as black-box experts to deliver problem-dependent regret bounds. For general convex functions, it maintains the minimax optimality and also achieves a small-loss bound.

Thu 21 July 13:25 - 13:30 PDT

Deep Hierarchy in Bandits

Joey Hong · Branislav Kveton · Sumeet Katariya · Manzil Zaheer · Mohammad Ghavamzadeh

Mean rewards of actions are often correlated. The form of these correlations may be complex and unknown a priori, such as the preferences of users for recommended products and their categories. To maximize statistical efficiency, it is important to leverage these correlations when learning. We formulate a bandit variant of this problem where the correlations of mean action rewards are represented by a hierarchical Bayesian model with latent variables. Since the hierarchy can have multiple layers, we call it deep. We propose a hierarchical Thompson sampling algorithm (HierTS) for this problem and show how to implement it efficiently for Gaussian hierarchies. The efficient implementation is possible due to a novel exact hierarchical representation of the posterior, which itself is of independent interest. We use this exact posterior to analyze the Bayes regret of HierTS. Our regret bounds reflect the structure of the problem, that the regret decreases with more informative priors, and can be recast to highlight reduced dependence on the number of actions. We confirm these theoretical findings empirically, in both synthetic and real-world experiments.

Thu 21 July 13:30 - 13:35 PDT

Distributionally-Aware Kernelized Bandit Problems for Risk Aversion

Sho Takemori

The kernelized bandit problem is a theoretically justified framework and has solid applications to various fields. Recently, there is a growing interest in generalizing the problem to the optimization of risk-averse metrics such as Conditional Value-at-Risk (CVaR) or Mean-Variance (MV).However, due to the model assumption, most existing methods need explicit design of environment random variables and can incur large regret because of possible high dimensionality of them.To address the issues, in this paper, we model environments using a family of the output distributions (or more precisely, probability kernel) and Kernel Mean Embeddings (KME), and provide novel UCB-type algorithms for CVaR and MV.Moreover, we provide algorithm-independent lower bounds for CVaR in the case of Mat\'ern kernels, and propose a nearly optimal algorithm.Furthermore, we empirically verify our theoretical result in synthetic environments, and demonstrate that our proposed method significantly outperforms a baseline in many cases.

Thu 21 July 13:35 - 13:40 PDT

Asymptotically-Optimal Gaussian Bandits with Side Observations

Alexia Atsidakou · Orestis Papadigenopoulos · Constantine Caramanis · Sujay Sanghavi · Sanjay Shakkottai

We study the problem of Gaussian bandits with general side information, as first introduced by Wu, Szepesv\'{a}ri, and Gy\"{o}rgy. In this setting, the play of an arm reveals information about other arms, according to an arbitrary {\em a priori} known {\em side information} matrix: each element of this matrix encodes the fidelity of the information that the row" arm reveals about thecolumn" arm. In the case of Gaussian noise, this model subsumes standard bandits, full-feedback, and graph-structured feedback as special cases. In this work, we first construct an LP-based asymptotic instance-dependent lower bound on the regret. The LP optimizes the cost (regret) required to reliably estimate the suboptimality gap of each arm. This LP lower bound motivates our main contribution: the first known asymptotically optimal algorithm for this general setting.

Thu 21 July 13:40 - 13:45 PDT

Learning from a Learning User for Optimal Recommendations

Fan Yao · Chuanhao Li · Denis Nekipelov · Hongning Wang · Haifeng Xu

In real-world recommendation problems, especially those with a formidably large item space, users have to gradually learn to estimate the utility of any fresh recommendations from their experience about previously consumed items. This in turn affects their interaction dynamics with the system and can invalidate previous algorithms built on the omniscient user assumption. In this paper, we formalize a model to capture such ''learning users'' and design an efficient system-side learning solution, coined Noise-Robust Active Ellipsoid Search (RAES), to confront the challenges brought by the non-stationary feedback from such a learning user. Interestingly, we prove that the regret of RAES deteriorates gracefully as the convergence rate of user learning becomes worse, until reaching linear regret when the user's learning fails to converge. Experiments on synthetic datasets demonstrate the strength of RAES for such a contemporaneous system-user learning problem. Our study provides a novel perspective on modeling the feedback loop in recommendation problems.

Thu 21 July 13:45 - 13:50 PDT

Thresholded Lasso Bandit

Kaito Ariu · Kenshi Abe · Alexandre Proutiere

In this paper, we revisit the regret minimization problem in sparse stochastic contextual linear bandits, where feature vectors may be of large dimension $d$, but where the reward function depends on a few, say $s_0\ll d$, of these features only. We present Thresholded Lasso bandit, an algorithm that (i) estimates the vector defining the reward function as well as its sparse support, i.e., significant feature elements, using the Lasso framework with thresholding, and (ii) selects an arm greedily according to this estimate projected on its support. The algorithm does not require prior knowledge of the sparsity index $s_0$ and can be parameter-free under some symmetric assumptions. For this simple algorithm, we establish non-asymptotic regret upper bounds scaling as $\mathcal{O}( \log d + \sqrt{T} )$ in general, and as $\mathcal{O}( \log d + \log T)$ under the so-called margin condition (a probabilistic condition on the separation of the arm rewards). The regret of previous algorithms scales as $\mathcal{O}( \log d + \sqrt{T \log (d T)})$ and $\mathcal{O}( \log T \log d)$ in the two settings, respectively. Through numerical experiments, we confirm that our algorithm outperforms existing methods.

Thu 21 July 13:50 - 13:55 PDT

Versatile Dueling Bandits: Best-of-both World Analyses for Learning from Relative Preferences

Aadirupa Saha · Pierre Gaillard

We study the problem of $K$-armed dueling bandit for both stochastic and adversarial environments, where the goal of the learner is to aggregate information through relative preferences of pair of decision points queried in an online sequential manner. We first propose a novel reduction from any (general) dueling bandits to multi-armed bandits which allows us to improve many existing results in dueling bandits. In particular, \emph{we give the first best-of-both world result for the dueling bandits regret minimization problem}---a unified framework that is guaranteed to perform optimally for both stochastic and adversarial preferences simultaneously. Moreover, our algorithm is also the first to achieve an optimal $O(\sum_{i = 1}^K \frac{\log T}{\Delta_i})$ regret bound against the Condorcet-winner benchmark, which scales optimally both in terms of the arm-size $K$ and the instance-specific suboptimality gaps $\{\Delta_i\}_{i = 1}^K$. This resolves the long standing problem of designing an instancewise gap-dependent order optimal regret algorithm for dueling bandits (with matching lower bounds up to small constant factors). We further justify the robustness of our proposed algorithm by proving its optimal regret rate under adversarially corrupted preferences---this outperforms the existing state-of-the-art corrupted dueling results by a large margin. In summary, we believe our reduction idea will find a broader scope in solving a diverse class of dueling bandits setting, which are otherwise studied separately from multi-armed bandits with often more complex solutions and worse guarantees. The efficacy of our proposed algorithms are empirically corroborated against state-of-the art dueling bandit methods.

Thu 21 July 13:55 - 14:00 PDT

Decentralized Online Convex Optimization in Networked Systems

Yiheng Lin · Judy Gan · Guannan Qu · Yash Kanoria · Adam Wierman

We study the problem of networked online convex optimization, where each agent individually decides on an action at every time step and agents cooperatively seek to minimize the total global cost over a finite horizon. The global cost is made up of three types of local costs: convex node costs, temporal interaction costs, and spatial interaction costs. In deciding their individual action at each time, an agent has access to predictions of local cost functions for the next $k$ time steps in an $r$-hop neighborhood. Our work proposes a novel online algorithm, Localized Predictive Control (LPC), which generalizes predictive control to multi-agent systems. We show that LPC achieves a competitive ratio of $1 + \tilde{O}(\rho_T^k) + \tilde{O}(\rho_S^r)$ in an adversarial setting, where $\rho_T$ and $\rho_S$ are constants in $(0, 1)$ that increase with the relative strength of temporal and spatial interaction costs, respectively. This is the first competitive ratio bound on decentralized predictive control for networked online convex optimization. Further, we show that the dependence on $k$ and $r$ in our results is near optimal by lower bounding the competitive ratio of any decentralized online algorithm.