Theoretical Foundations of Reinforcement Learning

Emma Brunskill, Thodoris Lykouris, Max Simchowitz, Wen Sun, Mengdi Wang

Keywords:  representation learning    reinforcement learning    sample-efficient exploration    policy gradient    bandits    safety in RL    human-in-the-loop RL    mutli-agent RL    off-policy learning  


In many settings such as education, healthcare, drug design, robotics, transportation, and achieving better-than-human performance in strategic games, it is important to make decisions sequentially. This poses two interconnected algorithmic and statistical challenges: effectively exploring to learn information about the underlying dynamics and effectively planning using this information. Reinforcement Learning (RL) is the main paradigm tackling both of these challenges simultaneously which is essential in the aforementioned applications. Over the last years, reinforcement learning has seen enormous progress both in solidifying our understanding on its theoretical underpinnings and in applying these methods in practice.

This workshop aims to highlight recent theoretical contributions, with an emphasis on addressing significant challenges on the road ahead. Such theoretical understanding is important in order to design algorithms that have robust and compelling performance in real-world applications. As part of the ICML 2020 conference, this workshop will be held virtually. It will feature keynote talks from six reinforcement learning experts tackling different significant facets of RL. It will also offer the opportunity for contributed material (see below the call for papers and our outstanding program committee). The authors of each accepted paper will prerecord a 10-minute presentation and will also appear in a poster session. Finally, the workshop will have a panel discussing important challenges in the road ahead.

Chat is not available.

Timezone: »


Fri 6:30 a.m. - 7:15 a.m.

Practical reinforcement learning algorithms often face the "deadly triad" [Rich Sutton, 2015]: function approximation, data efficiency (e.g. by bootstrapping value function estimates), and exploration (e.g. by off-policy learning). Algorithms which address two without the third are often ok, while trying to address all three leads to highly unstable algorithms in practice. This talk considers a policy gradient approach to alleviate these issues. In particular, we introduce the Policy Cover-Policy Gradient (PC-PG) algorithm, which provably balances the exploration vs. exploitation tradeoff, with polynomial sample complexity, using an ensemble of learned policies (the policy cover). We quantify how the relevant notion of function approximation is based on an approximation error term, under distribution shift. Furthermore, we will provide simple examples where a number of standard (and provable) RL approaches are less robust when it comes to function approximation. Time permitting, we will discuss the implications this has for more effective data re-use.

Joint work with: Alekh Agarwal, Mikael Henaff, and Wen Sun.

Sham Kakade
Fri 7:20 a.m. - 8:05 a.m.

The principle of optimism in the face of uncertainty underpins many theoretically successful reinforcement learning algorithms. In this paper we provide a general framework for designing, analyzing and implementing such algorithms in the episodic reinforcement learning problem. This framework is built upon Lagrangian duality, and demonstrates that every model-optimistic algorithm that constructs an optimistic MDP has an equivalent representation as a value-optimistic dynamic programming algorithm. Typically, it was thought that these two classes of algorithms were distinct, with model-optimistic algorithms benefiting from a cleaner probabilistic analysis while value-optimistic algorithms are easier to implement and thus more practical. With the framework developed in this paper, we show that it is possible to get the best of both worlds by providing a class of algorithms which have a computationally efficient dynamic-programming implementation and also a simple probabilistic analysis. Besides being able to capture many existing algorithms in the tabular setting, our framework can also address largescale problems under realizable function approximation, where it enables a simple model-based analysis of some recently proposed methods.

Gergely Neu
Fri 8:10 a.m. - 9:25 a.m.
Poster Session 1 (Poster Session)
Fri 9:30 a.m. - 10:25 a.m.
Speaker Panel (Panel)
Csaba Szepesvari, Martha White, Sham Kakade, Gergely Neu, Shipra Agrawal, Akshay Krishnamurthy
Fri 10:30 a.m. - 11:15 a.m.

The goal of the talk is to discuss our recent work on an off-policy policy gradient theorem, and how it can help leverage theory for the on-policy setting for use in the off-policy setting. The key insight is to provide a more general objective for the off-policy setting, that encompasses the on-policy episodic objective. These simple generalizations make it straightforward to port and generalize ideas.

Martha White
Fri 11:20 a.m. - 11:35 a.m.

We study regret minimization in stochastic structured bandits. The fact that the popular optimistic algorithms do not achieve the asymptotic instance-dependent regret optimality has recently allured researchers. On the other hand, it is known that one can achieve a bounded regret (i.e., does not grow indefinitely with ) in certain instances. Unfortunately, existing asymptotically optimal algorithms rely on forced sampling that introduces an term w.r.t. the time horizon in their regret, failing to adapt to the ``easiness'' of the instance. In this paper, we focus on the finite hypothesis class case and ask if one can achieve the asymptotic optimality while enjoying bounded regret whenever possible. We provide a positive answer via a new algorithm called CRush Optimism with Pessimism (CROP). Our analysis shows that CROP achieves a constant-factor asymptotic optimality and, thanks to the forced-exploration-free design, adapts to bounded regret, and its regret bound scales not with the number of arms but with an effective number of arms that we introduce. We also show that CROP can be exponentially better than existing algorithms in the \textit{nonasymptotic} regimes. Finally, we observe that even a clairvoyant oracle who plays according to the asymptotically optimal arm pull scheme may suffer a linear worst-case regret, indicating that it may not be the end of optimism. We believe our work may inspire a new family of algorithms for bandits and reinforcement learning.

Kwang-Sung Jun, Chicheng Zhang

Kwang-Sung Jun
Fri 11:35 a.m. - 11:50 a.m.

We introduce the technique of adaptive discretization to design efficient model-based episodic reinforcement learning algorithms in large (potentially continuous) state-action spaces. Our algorithm is based on optimistic one-step value iteration extended to maintain an adaptive discretization of the space. From a theoretical perspective, we provide worst-case regret bounds for our algorithm, which are competitive compared to the state-of-the-art RL algorithms; moreover, our bounds are obtained via a modular proof technique, which can potentially extend to incorporate additional structure on the problem. Our algorithm has much lower storage and computational requirements, due to maintaining a more efficient partition of the state and action spaces. We illustrate this via experiments on several canonical control problems, which shows that our algorithm empirically performs significantly better than fixed discretization in terms of both faster convergence and lower memory usage.

Sean R. Sinclair, Tianyu Wang, Gauri Jain, Sid Banerjee, Christina Yu

Sean Sinclair
Fri 11:50 a.m. - 12:05 p.m.

In this work, we propose KeRNS: an algorithm for episodic reinforcement learning in non-stationary Markov Decision Processes (MDPs) whose state-action set is endowed with a metric. Using a non-parametric model of the MDP built with time-dependent kernels, we prove a regret bound that scales with the covering dimension of the state-action space and the total variation of the MDP with time, which quantifies its level of non-stationarity. Our method generalizes previous approaches based on sliding windows and exponential discounting used to handle changing environments.

Omar Darwiche Domingues, Pierre MENARD, Matteo Pirotta, Emilie Kaufmann, Michal Valko

Omar Darwiche Domingues
Fri 12:05 p.m. - 12:20 p.m.

We consider regret minimization for online control with time-varying linear dynamical systems. The metric of performance we study is adaptive policy regret, or regret compared to the best policy on {\it any interval in time}. We give an efficient algorithm that attains first-order adaptive regret guarantees for the setting of online convex optimization with memory, subsequently used to derive a controller with such guarantees. We show that these bounds are nearly tight and validate these theoretical findings experimentally on simulations of time-varying dynamics and disturbances.

Paula Gradu, Elad Hazan, Edgar Minasyan

Edgar Minasyan
Fri 12:20 p.m. - 12:35 p.m.

This paper considers the problem of designing optimal algorithms for reinforcement learning in two-player zero-sum games. We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision. In a tabular episodic Markov game with states, max-player actions and min-player actions, the best existing algorithm for finding an approximate Nash equilibrium requires steps of game playing, when only highlighting the dependency on . In contrast, the best existing lower bound scales as and has a significant gap from the upper bound. This paper closes this gap for the first time: we propose an optimistic variant of the \emph{Nash Q-learning} algorithm with sample complexity , and a new \emph{Nash V-learning} algorithm with sample complexity . The latter result matches the information-theoretic lower bound in all problem-dependent parameters except for a polynomial factor of the length of each episode. We complement our upper bounds with a computational hardness result for achieving sublinear regret when playing against adversarial opponents in Markov games.

Yu Bai, Chi Jin, Tiancheng Yu

Tiancheng Yu
Fri 12:35 p.m. - 12:50 p.m.

The literature on ranking from ordinal data is vast, and there are several ways to aggregate overall preferences from pairwise comparisons between objects. In particular, it is well-known that any Nash equilibrium of the zero-sum game induced by the preference matrix defines a natural solution concept (winning distribution over objects) known as a von Neumann winner. Many real-world problems, however, are inevitably multi-criteria, with different pairwise preferences governing the different criteria. In this work, we generalize the notion of a von Neumann winner to the multi-criteria setting by taking inspiration from Blackwell’s approachability. Our framework allows for non-linear aggregation of preferences across criteria, and generalizes the linearization-based approach from multi-objective optimization.

From a theoretical standpoint, we show that the Blackwell winner of a multi-criteria problem instance can be computed as the solution to a convex optimization problem. Furthermore, given random samples of pairwise comparisons, we show that a simple, "plug-in" estimator achieves (near-)optimal minimax sample complexity. Finally, we showcase the practical utility of our framework in a user study on autonomous driving, where we find that the Blackwell winner outperforms the von Neumann winner for the overall preferences.

Kush Bhatia, Ashwin Pananjady, Peter Bartlett, Anca Dragan, Martin Wainwright

Kush Bhatia
Fri 1:00 p.m. - 2:15 p.m.
Poster Session 2 (Poster Session)
Fri 2:20 p.m. - 3:05 p.m.

I will discuss new provably efficient algorithms for reinforcement in rich observation environments with arbitrarly large state spaces. Both algorithms operate by learning succinct representations of the environment, which they use in an exploration module to acquire new information. The first algorithm, called Homer, operates in a block MDP model and uses a contrastive learning objective to learn the representation. On the other hand, the second algorithm, called FLAMBE, operates in a much richer class of low rank MDPs and is model based. Both algorithms accommodate nonlinear function approximation and enjoy provable sample and computational efficiency guarantees.

Akshay Krishnamurthy
Fri 3:10 p.m. - 3:55 p.m.

We consider a novel formulation of the dynamic pricing and demand learning problem, where the evolution of demand in response to posted prices is governed by a stochastic variant of the popular Bass model with parameters (α, β) that are linked to the so-called "innovation" and "imitation" effects. Unlike the more commonly used i.i.d. demand models, in this model the price posted not only effects the demand and the revenue in the current round but also the evolution of demand, and hence the fraction of market potential that can be captured, in future rounds. Finding a revenue-maximizing dynamic pricing policy in this model is non-trivial even when model parameters are known, and requires solving for optimal non-stationary policy of a continuous-time, continuous-state MDP. In this paper, we consider a more challenging problem where dynamic pricing is used in conjunction with learning the model parameters, with the objective of optimizing the cumulative revenues over a given selling horizon. Our main contribution is an algorithm with a regret guarantee of O (m^2/3), where m is mnemonic for the (known) market size, along with a matching lower bound.

Shipra Agrawal
Fri 4:00 p.m. - 4:45 p.m.

Large-scale Markov decision processes (MDPs) require planning algorithms with runtime independent of the number of states of the MDP. We consider the planning problem in MDPs using linear value function approximation with only weak requirements: low approximation error for the optimal value function, and a small set of "core" states whose features span those of other states. In particular, we make no assumptions about the representability of policies or value functions of non-optimal policies. Our algorithm produces almost-optimal actions for any state using a generative oracle (simulator) for the MDP, while its computation time scales polynomially with the number of features, core states, and actions and the effective horizon. I will discuss how this is achieved, some selected part of the vast related literature and what remains open.

Joint work with Roshan Shariff

Csaba Szepesvari