Track: T: Bandits/Online Learning/Reinforcement Learning

Thu 21 July 7:30 - 7:50 PDT

Oral

First-Order Regret in Reinforcement Learning with Linear Function Approximation: A Robust Estimation Approach

Andrew Wagenmaker ⋅ Yifang Chen ⋅ Max Simchowitz ⋅ Simon Du ⋅ Kevin Jamieson

Obtaining first-order regret bounds---regret bounds scaling not as the worst-case but with some measure of the performance of the optimal policy on a given instance---is a core question in sequential decision-making. While such bounds exist in many settings, they have proven elusive in reinforcement learning with large state spaces. In this work we address this gap, and show that it is possible to obtain regret scaling as $\widetilde{\mathcal{O}}(\sqrt{d^3 H^3 \cdot V_1^\star \cdot K} + d^{3.5}H^3\log K )$ in reinforcement learning with large state spaces, namely the linear MDP setting. Here $V_1^\star$ is the value of the optimal policy and $K$ is the number of episodes. We demonstrate that existing techniques based on least squares estimation are insufficient to obtain this result, and instead develop a novel robust self-normalized concentration bound based on the robust Catoni mean estimator, which may be of independent interest.

Thu 21 July 7:50 - 7:55 PDT

Spotlight

Generic Coreset for Scalable Learning of Monotonic Kernels: Logistic Regression, Sigmoid and more

Elad Tolochinksy ⋅ Ibrahim Jubran ⋅ Dan Feldman

Coreset (or core-set) is a small weighted \emph{subset} $Q$ of an input set $P$ with respect to a given \emph{monotonic} function $f:\mathbb{R}\to\mathbb{R}$ that \emph{provably} approximates its fitting loss $\sum_{p\in P}f(p\cdot x)$ to \emph{any} given $x\in\mathbb{R}^d$. Using $Q$ we can obtain an approximation of $x^*$ that minimizes this loss, by running \emph{existing} optimization algorithms on $Q$. In this work we provide: (i) A lower bound which proves that there are sets with no coresets smaller than $n=|P|$ for general monotonic loss functions. (ii) A proof that, with an additional common regularization term and under a natural assumption that holds e.g. for logistic regression and the sigmoid activation functions, a small coreset exists for \emph{any} input $P$. (iii) A generic coreset construction algorithm that computes such a small coreset $Q$ in $O(nd+n\log n)$ time, and (iv) Experimental results with open-source code which demonstrate that our coresets are effective and are much smaller in practice than predicted in theory.

Thu 21 July 7:55 - 8:00 PDT

Spotlight

Shuffle Private Linear Contextual Bandits

Sayak Ray Chowdhury ⋅ Xingyu Zhou

Differential privacy (DP) has been recently introduced to linear contextual bandits to formally address the privacy concerns in its associated personalized services to participating users (e.g., recommendations).Prior work largely focus on two trust models of DP -- the central model, where a central server is responsible for protecting users’ sensitive data, and the (stronger) local model, where information needs to be protected directly on users' side. However, there remains a fundamental gap in the utility achieved by learning algorithms under these two privacy models, e.g., if all users are \emph{unique} within a learning horizon $T$, $\widetilde{O}(\sqrt{T})$ regret in the central model as compared to $\widetilde{O}(T^{3/4})$ regret in the local model. In this work, we aim to achieve a stronger model of trust than the central model, while suffering a smaller regret than the local model by considering recently popular \emph{shuffle} model of privacy. We propose a general algorithmic framework for linear contextual bandits under the shuffle trust model, where there exists a trusted shuffler -- in between users and the central server-- that randomly permutes a batch of users data before sending those to the server. We then instantiate this framework with two specific shuffle protocols -- one relying on privacy amplification of local mechanisms, and another incorporating a protocol for summing vectors and matrices of bounded norms. We prove that both these instantiations lead to regret guarantees that significantly improve on that of the local model, and can potentially be of the order $\widetilde{O}(T^{3/5})$ if all users are unique. We also verify this regret behavior with simulations on synthetic data. Finally, under the practical scenario of non-unique users, we show that the regret of our shuffle private algorithm scale as $\widetilde{O}(T^{2/3})$, which \emph{matches} what the central model could achieve in this case.

Thu 21 July 8:00 - 8:05 PDT

Spotlight

Pessimistic Q-Learning for Offline Reinforcement Learning: Towards Optimal Sample Complexity

Laixi Shi ⋅ Gen Li ⋅ Yuting Wei ⋅ Yuxin Chen ⋅ Yuejie Chi

Offline or batch reinforcement learning seeks to learn a near-optimal policy using history data without active exploration of the environment. To counter the insufficient coverage and sample scarcity of many offline datasets, the principle of pessimism has been recently introduced to mitigate high bias of the estimated values. While pessimistic variants of model-based algorithms (e.g., value iteration with lower confidence bounds) have been theoretically investigated, their model-free counterparts --- which do not require explicit model estimation --- have not been adequately studied, especially in terms of sample efficiency. To address this inadequacy, we study a pessimistic variant of Q-learning in the context of finite-horizon Markov decision processes, and characterize its sample complexity under the single-policy concentrability assumption which does not require the full coverage of the state-action space. In addition, a variance-reduced pessimistic Q-learning algorithm is proposed to achieve near-optimal sample complexity. Altogether, this work highlights the efficiency of model-free algorithms in offline RL when used in conjunction with pessimism and variance reduction.

Thu 21 July 8:05 - 8:10 PDT

Spotlight

Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov Decision Processes

Andrew Wagenmaker ⋅ Yifang Chen ⋅ Max Simchowitz ⋅ Simon Du ⋅ Kevin Jamieson

Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration, but must propose a near-optimal policy for an arbitrary reward function revealed only after exploring. In the the tabular setting, it is well known that this is a more difficult problem than reward-aware (PAC) RL---where the agent has access to the reward function during exploration---with optimal sample complexities in the two settings differing by a factor of $|\mathcal{S}|$, the size of the state space. We show that this separation does not exist in the setting of linear MDPs. We first develop a computationally efficient algorithm for reward-free RL in a $d$-dimensional linear MDP with sample complexity scaling as $\widetilde{\mathcal{O}}(d^2 H^5/\epsilon^2)$. We then show a lower bound with matching dimension-dependence of $\Omega(d^2 H^2/\epsilon^2)$, which holds for the reward-aware RL setting. To our knowledge, our approach is the first computationally efficient algorithm to achieve optimal $d$ dependence in linear MDPs, even in the single-reward PAC setting. Our algorithm relies on a novel procedure which efficiently traverses a linear MDP, collecting samples in any given ``feature direction'', and enjoys a sample complexity scaling optimally in the (linear MDP equivalent of the) maximal state visitation probability. We show that this exploration procedure can also be applied to solve the problem of obtaining ``well-conditioned'' covariates in linear MDPs.

Thu 21 July 8:10 - 8:30 PDT

Oral

Label Ranking through Nonparametric Regression

Dimitris Fotakis ⋅ Alkis Kalavasis ⋅ Eleni Psaroudaki

Label Ranking (LR) corresponds to the problem of learning a hypothesis that maps features to rankings over a finite set of labels. We adopt a nonparametric regression approach to LR and obtain theoretical performance guarantees for this fundamental practical problem. We introduce a generative model for Label Ranking, in noiseless and noisy nonparametric regression settings, and provide sample complexity bounds for learning algorithms in both cases. In the noiseless setting, we study the LR problem with full rankings and provide computationally efficient algorithms using decision trees and random forests in the high-dimensional regime. In the noisy setting, we consider the more general cases of LR with incomplete and partial rankings from a statistical viewpoint and obtain sample complexity bounds using the One-Versus-One approach of multiclass classification. Finally, we complement our theoretical contributions with experiments, aiming to understand how the input regression noise affects the observed output.

Thu 21 July 8:30 - 8:35 PDT

Spotlight

Sample-Efficient Reinforcement Learning with loglog(T) Switching Cost

Dan Qiao ⋅ Ming Yin ⋅ Ming Min ⋅ Yu-Xiang Wang

We study the problem of reinforcement learning (RL) with low (policy) switching cost — a problem well-motivated by real-life RL applications in which deployments of new policies are costly and the number of policy updates must be low. In this paper, we propose a new algorithm based on stage-wise exploration and adaptive policy elimination that achieves a regret of $\widetilde{O}(\sqrt{H^4S^2AT})$ while requiring a switching cost of $O(HSA \log\log T)$. This is an exponential improvement over the best-known switching cost $O(H^2SA\log T)$ among existing methods with $\widetilde{O}(\mathrm{poly}(H,S,A)\sqrt{T})$ regret. In the above, $S,A$ denotes the number of states and actions in an $H$-horizon episodic Markov Decision Process model with unknown transitions, and $T$ is the number of steps. As a byproduct of our new techniques, we also derive a reward-free exploration algorithm with a switching cost of $O(HSA)$. Furthermore, we prove a pair of information-theoretical lower bounds which say that (1) Any no-regret algorithm must have a switching cost of $\Omega(HSA)$; (2) Any $\widetilde{O}(\sqrt{T})$ regret algorithm must incur a switching cost of $\Omega(HSA\log\log T)$. Both our algorithms are thus optimal in their switching costs.

Thu 21 July 8:35 - 8:40 PDT

Spotlight

A Simple Unified Framework for High Dimensional Bandit Problems

Wenjie Li ⋅ Adarsh Barik ⋅ Jean Honorio

Stochastic high dimensional bandit problems with low dimensional structures are useful in different applications such as online advertising and drug discovery. In this work, we propose a simple unified algorithm for such problems and present a general analysis framework for the regret upper bound of our algorithm. We show that under some mild unified assumptions, our algorithm can be applied to different high-dimensional bandit problems. Our framework utilizes the low dimensional structure to guide the parameter estimation in the problem, therefore our algorithm achieves the comparable regret bounds in the LASSO bandit as a sanity check, as well as novel bounds that depend logarithmically on dimensions in the low-rank matrix bandit, the group sparse matrix bandit, and in a new problem: the multi-agent LASSO bandit.

Thu 21 July 8:40 - 8:45 PDT

Spotlight

A Reduction from Linear Contextual Bandits Lower Bounds to Estimations Lower Bounds

Jiahao He ⋅ Jiheng Zhang ⋅ Rachel Q. Zhang

Linear contextual bandits and their variants are usually solved using algorithms guided by parameter estimation. Cauchy-Schwartz inequality established that estimation errors dominate algorithm regrets, and thus, accurate estimators suffice to guarantee algorithms with low regrets. In this paper, we complete the reverse direction by establishing the necessity. In particular, we provide a generic transformation from algorithms for linear contextual bandits to estimators for linear models, and show that algorithm regrets dominate estimation errors of their induced estimators, i.e., low-regret algorithms must imply accurate estimators. Moreover, our analysis reduces the regret lower bound to an estimation error, bridging the lower bound analysis in linear contextual bandit problems and linear regression.

Thu 21 July 8:45 - 8:50 PDT

Spotlight

Branching Reinforcement Learning

Yihan Du ⋅ Wei Chen

In this paper, we propose a novel Branching Reinforcement Learning (Branching RL) model, and investigate both Regret Minimization (RM) and Reward-Free Exploration (RFE) metrics for this model. Unlike standard RL where the trajectory of each episode is a single $H$-step path, branching RL allows an agent to take multiple base actions in a state such that transitions branch out to multiple successor states correspondingly, and thus it generates a tree-structured trajectory. This model finds important applications in hierarchical recommendation systems and online advertising. For branching RL, we establish new Bellman equations and key lemmas, i.e., branching value difference lemma and branching law of total variance, and also bound the total variance by only $O(H^2)$ under an exponentially-large trajectory. For RM and RFE metrics, we propose computationally efficient algorithms BranchVI and BranchRFE, respectively, and derive nearly matching upper and lower bounds. Our regret and sample complexity results are polynomial in all problem parameters despite exponentially-large trajectories.

Thu 21 July 8:50 - 8:55 PDT

Spotlight

Fast rates for noisy interpolation require rethinking the effect of inductive bias

Konstantin Donhauser ⋅ Nicolò Ruggeri ⋅ Stefan Stojanovic ⋅ Fanny Yang

Good generalization performance on high-dimensional data crucially hinges on a simple structure of the ground truth and a corresponding strong inductive bias of the estimator. Even though this intuition is valid for regularized models, in this paper we caution against a strong inductive bias for interpolation in the presence of noise: While a stronger inductive bias encourages a simpler structure that is more aligned with the ground truth, it also increases the detrimental effect of noise. Specifically, for both linear regression and classification with a sparse ground truth, we prove that minimum $\ell_p$-norm and maximum $\ell_p$-margin interpolators achieve fast polynomial rates close to order $1/n$ for $p > 1$ compared to a logarithmic rate for $p = 1$. Finally, we provide preliminary experimental evidence that this trade-off may also play a crucial role in understanding non-linear interpolating models used in practice.

Thu 21 July 8:55 - 9:00 PDT

Spotlight

Near-Optimal Algorithms for Autonomous Exploration and Multi-Goal Stochastic Shortest Path

Haoyuan Cai ⋅ Tengyu Ma ⋅ Simon Du

We revisit the incremental autonomous exploration problem proposed by Lim and Auer (2012). In this setting, the agent aims to learn a set of near-optimal goal-conditioned policies to reach the $L$-controllable states: states that are incrementally reachable from an initial state $s_0$ within $L$ steps in expectation. We introduce a new algorithm with stronger sample complexity bounds than existing ones. Furthermore, we also prove the first lower bound for the autonomous exploration problem. In particular, the lower bound implies that our proposed algorithm, Value-Aware Autonomous Exploration, is nearly minimax-optimal when the number of $L$-controllable states grows polynomially with respect to $L$. Key in our algorithm design is a connection between autonomous exploration and multi-goal stochastic shortest path, a new problem that naturally generalizes the classical stochastic shortest path problem. This new problem and its connection to autonomous exploration can be of independent interest.

Main Navigation

Session

T: Bandits/Online Learning/Reinforcement Learning

First-Order Regret in Reinforcement Learning with Linear Function Approximation: A Robust Estimation Approach

Generic Coreset for Scalable Learning of Monotonic Kernels: Logistic Regression, Sigmoid and more

Shuffle Private Linear Contextual Bandits

Pessimistic Q-Learning for Offline Reinforcement Learning: Towards Optimal Sample Complexity

Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov Decision Processes

Label Ranking through Nonparametric Regression

Sample-Efficient Reinforcement Learning with loglog(T) Switching Cost

A Simple Unified Framework for High Dimensional Bandit Problems

A Reduction from Linear Contextual Bandits Lower Bounds to Estimations Lower Bounds

Branching Reinforcement Learning

Fast rates for noisy interpolation require rethinking the effect of inductive bias

Near-Optimal Algorithms for Autonomous Exploration and Multi-Goal Stochastic Shortest Path