### Session

## Reinforcement Learning and Bandits

Moderator: Marcello Restelli

**Instabilities of Offline RL with Pre-Trained Neural Representation**

Ruosong Wang · Yifan Wu · Ruslan Salakhutdinov · Sham Kakade

In offline reinforcement learning (RL), we seek to utilize offline data to evaluate (or learn) policies in scenarios where the data are collected from a distribution that substantially differs from that of the target policy to be evaluated. Recent theoretical advances have shown that such sample-efficient offline RL is indeed possible provided certain strong representational conditions hold, else there are lower bounds exhibiting exponential error amplification (in the problem horizon) unless the data collection distribution has only a mild distribution shift relative to the target policy. This work studies these issues from an empirical perspective to gauge how stable offline RL methods are. In particular, our methodology explores these ideas when using features from pre-trained neural networks, in the hope that these representations are powerful enough to permit sample efficient offline RL. Through extensive experiments on a range of tasks, we see that substantial error amplification does occur even when using such pre-trained representations (trained on the same task itself); we find offline RL is stable only under extremely mild distribution shift. The implications of these results, both from a theoretical and an empirical perspective, are that successful offline RL (where we seek to go beyond the low distribution shift regime) requires substantially stronger conditions beyond those which suffice for successful supervised learning.

**Path Planning using Neural A* Search**

Ryo Yonetani · Tatsunori Taniai · Mohammadamin Barekatain · Mai Nishimura · Asako Kanezaki

We present Neural A*, a novel data-driven search method for path planning problems. Despite the recent increasing attention to data-driven path planning, machine learning approaches to search-based planning are still challenging due to the discrete nature of search algorithms. In this work, we reformulate a canonical A* search algorithm to be differentiable and couple it with a convolutional encoder to form an end-to-end trainable neural network planner. Neural A* solves a path planning problem by encoding a problem instance to a guidance map and then performing the differentiable A* search with the guidance map. By learning to match the search results with ground-truth paths provided by experts, Neural A* can produce a path consistent with the ground truth accurately and efficiently. Our extensive experiments confirmed that Neural A* outperformed state-of-the-art data-driven planners in terms of the search optimality and efficiency trade-off. Furthermore, Neural A* successfully predicted realistic human trajectories by directly performing search-based planning on natural image inputs.

**Group-Sparse Matrix Factorization for Transfer Learning of Word Embeddings**

Kan Xu · Xuanyi Zhao · Hamsa Bastani · Osbert Bastani

Sparse regression has recently been applied to enable transfer learning from very limited data. We study an extension of this approach to unsupervised learning---in particular, learning word embeddings from unstructured text corpora using low-rank matrix factorization. Intuitively, when transferring word embeddings to a new domain, we expect that the embeddings change for only a small number of words---e.g., the ones with novel meanings in that domain. We propose a novel group-sparse penalty that exploits this sparsity to perform transfer learning when there is very little text data available in the target domain---e.g., a single article of text. We prove generalization bounds for our algorithm. Furthermore, we empirically evaluate its effectiveness, both in terms of prediction accuracy in downstream tasks as well as in terms of interpretability of the results.

**Tightening the Dependence on Horizon in the Sample Complexity of Q-Learning**

Gen Li · Changxiao Cai · Yuxin Chen · Yuantao Gu · Yuting Wei · Yuejie Chi

Q-learning, which seeks to learn the optimal Q-function of a Markov decision process (MDP) in a model-free fashion, lies at the heart of reinforcement learning. Focusing on the synchronous setting (such that independent samples for all state-action pairs are queried via a generative model in each iteration), substantial progress has been made recently towards understanding the sample efficiency of Q-learning. To yield an entrywise $\varepsilon$-accurate estimate of the optimal Q-function, state-of-the-art theory requires at least an order of $\frac{|S||A|}{(1-\gamma)^5\varepsilon^{2}}$ samples in the infinite-horizon $\gamma$-discounted setting. In this work, we sharpen the sample complexity of synchronous Q-learning to the order of $\frac{|S||A|}{(1-\gamma)^4\varepsilon^2}$ (up to some logarithmic factor) for any $0<\varepsilon <1$, leading to an order-wise improvement in $\frac{1}{1-\gamma}$. Analogous results are derived for finite-horizon MDPs as well. Notably, our sample complexity analysis unveils the effectiveness of vanilla Q-learning, which matches that of speedy Q-learning without requiring extra computation and storage. Our result is obtained by identifying novel error decompositions and recursion relations, which might shed light on how to study other variants of Q-learning.

**Solving Challenging Dexterous Manipulation Tasks With Trajectory Optimisation and Reinforcement Learning**

Henry Charlesworth · Giovanni Montana

Training agents to autonomously control anthropomorphic robotic hands has the potential to lead to systems capable of performing a multitude of complex manipulation tasks in unstructured and uncertain environments. In this work, we first introduce a suite of challenging simulated manipulation tasks where current reinforcement learning and trajectory optimisation techniques perform poorly. These include environments where two simulated hands have to pass or throw objects between each other, as well as an environment where the agent must learn to spin a long pen between its fingers. We then introduce a simple trajectory optimisation algorithm that performs significantly better than existing methods on these environments. Finally, on the most challenging ``PenSpin" task, we combine sub-optimal demonstrations generated through trajectory optimisation with off-policy reinforcement learning, obtaining performance that far exceeds either of these approaches individually. Videos of all of our results are available at: https://dexterous-manipulation.github.io

**Continuous-time Model-based Reinforcement Learning**

Cagatay Yildiz · Markus Heinonen · Harri Lähdesmäki

Model-based reinforcement learning (MBRL) approaches rely on discrete-time state transition models whereas physical systems and the vast majority of control tasks operate in continuous-time. To avoid time-discretization approximation of the underlying process, we propose a continuous-time MBRL framework based on a novel actor-critic method. Our approach also infers the unknown state evolution differentials with Bayesian neural ordinary differential equations (ODE) to account for epistemic uncertainty. We implement and test our method on a new ODE-RL suite that explicitly solves continuous-time control systems. Our experiments illustrate that the model is robust against irregular and noisy data, and can solve classic control problems in a sample-efficient manner.

**Bayesian Optimistic Optimisation with Exponentially Decaying Regret**

Hung Tran-The · Sunil Gupta · Santu Rana · Svetha Venkatesh

Bayesian optimisation (BO) is a well known algorithm for finding the global optimum of expensive, black-box functions. The current practical BO algorithms have regret bounds ranging from $\mathcal{O}(\frac{logN}{\sqrt{N}})$ to $\mathcal O(e^{-\sqrt{N}})$, where $N$ is the number of evaluations. This paper explores the possibility of improving the regret bound in the noise-free setting by intertwining concepts from BO and optimistic optimisation methods which are based on partitioning the search space. We propose the BOO algorithm, a first practical approach which can achieve an exponential regret bound with order $\mathcal O(N^{-\sqrt{N}})$ under the assumption that the objective function is sampled from a Gaussian process with a Mat\'ern kernel with smoothness parameter $\nu > 4 +\frac{D}{2}$, where $D$ is the number of dimensions. We perform experiments on optimisation of various synthetic functions and machine learning hyperparameter tuning tasks and show that our algorithm outperforms baselines.

**Best Model Identification: A Rested Bandit Formulation**

Leonardo Cella · Massimiliano Pontil · Claudio Gentile

We introduce and analyze a best arm identification problem in the rested bandit setting, wherein arms are themselves learning algorithms whose expected losses decrease with the number of times the arm has been played. The shape of the expected loss functions is similar across arms, and is assumed to be available up to unknown parameters that have to be learned on the fly. We define a novel notion of regret for this problem, where we compare to the policy that always plays the arm having the smallest expected loss at the end of the game. We analyze an arm elimination algorithm whose regret vanishes as the time horizon increases. The actual rate of convergence depends in a detailed way on the postulated functional form of the expected losses. We complement our analysis with lower bounds, indicating strengths and limitations of the proposed solution.

**Global Convergence of Policy Gradient for Linear-Quadratic Mean-Field Control/Game in Continuous Time**

Weichen Wang · Jiequn Han · Zhuoran Yang · Zhaoran Wang

Recent years have witnessed the success of multi-agent reinforcement learning, which has motivated new research directions for mean-field control (MFC) and mean-field game (MFG), as the multi-agent system can be well approximated by a mean-field problem when the number of agents grows to be very large. In this paper, we study the policy gradient (PG) method for the linear-quadratic mean-field control and game, where we assume each agent has identical linear state transitions and quadratic cost functions. While most recent works on policy gradient for MFC and MFG are based on discrete-time models, we focus on a continuous-time model where some of our analyzing techniques could be valuable to the interested readers. For both the MFC and the MFG, we provide PG update and show that it converges to the optimal solution at a linear rate, which is verified by a synthetic simulation. For the MFG, we also provide sufficient conditions for the existence and uniqueness of the Nash equilibrium.