Timezone: »

 
Workshop
Workshop on Reinforcement Learning Theory
Shipra Agrawal · Simon Du · Niao He · Csaba Szepesvari · Lin Yang

Sat Jul 24 09:00 AM -- 04:25 PM (PDT) @ None
Event URL: https://lyang36.github.io/icml2021_rltheory/ »

While over many years we have witnessed numerous impressive demonstrations of the power of various reinforcement learning (RL) algorithms, and while much progress was made on the theoretical side as well, the theoretical understanding of the challenges that underlie RL is still rather limited. The best-studied problem settings, such as learning and acting in finite state-action Markov decision processes, or simple linear control systems fail to capture the essential characteristics of seemingly more practically relevant problem classes, where the size of the state-action space is often astronomical, the planning horizon is huge, the dynamics is complex, interaction with the controlled system is not permitted, or learning has to happen based on heterogeneous offline data, etc. To tackle these diverse issues, more and more theoreticians with a wide range of backgrounds came to study RL and have proposed numerous new models along with exciting novel developments on both algorithm design and analysis. The workshop's goal is to highlight advances in theoretical RL and bring together researchers from different backgrounds to discuss RL theory from different perspectives: modeling, algorithm, analysis, etc.

Sat 9:00 a.m. - 9:25 a.m.
  

This talk will be about some recent works on the sample complexity of pure-exploration in an episodic MDP, when we sequentially collect trajectories, with or without rewards, and aim at identifying good policies. In particular, I will describe an optimal algorithm for reward-free exploration called RF-Express, which interestingly features some exploration bonuses that scale in 1/n instead of the usual 1/\sqrt{n}.

Emilie Kaufmann
Sat 9:30 a.m. - 9:55 a.m.
 link »   

Iterative methods for approximating zero-sum Nash equilibria in extensive-form games have been a core component of recent advances in superhuman poker AIs. In this talk, I will first give an optimization-oriented description of how these methods work. Then, I will discuss and contrast two recent results: First, the development of a new entropy-based regularization method for the decision spaces associated with extensive-form games, which is simultaneously simpler to analyze and has better theoretical properties than the current state of the art. Second, I will discuss new algorithms based on optimistic variants of regret matching and CFR, which lead to very strong practical performance, in spite of inferior theoretical guarantees.

This talk is based on joint work with Gabriele Farina and Tuomas Sandholm.

Christian Kroer
Sat 10:00 a.m. - 10:12 a.m.
[ Visit Poster at Spot C4 in Virtual World ]   

A fundamental concept in control theory is that of controllability, where any system state can be reached through an appropriate choice of control inputs. Indeed, a large body of classical and modern approaches are designed for controllable linear dynamical systems. However, in practice, we often encounter systems in which a large set of state variables evolve exogenously and independently of the control inputs; such systems are only \emph{partially controllable}. The focus of this work is on a large class of partially controllable linear dynamical system, specified by an underlying sparsity pattern. Our main results establish structural conditions and finite-sample guarantees for learning to control such systems. In particular, our structural results characterize those state variables which are irrelevant for optimal control, an analysis which departs from classical control techniques. Our algorithmic results adapt techniques from high-dimensional statistics---specifically soft-thresholding and semiparametric least-squares---to exploit the underlying sparsity pattern in order to obtain finite-sample guarantees that significantly improve over those based on certainty-equivalence. We also corroborate these theoretical improvements over certainty-equivalent control through a simulation study.

Yonathan Efroni, Sham Kakade, Akshay Krishnamurthy, Cyril Zhang
Sat 10:15 a.m. - 10:27 a.m.
[ Visit Poster at Spot B3 in Virtual World ]   

We study a theory of reinforcement learning (RL) in which the learner receives binary feedback only once at the end of an episode. While this is an extreme test case for theory, it is also arguably more representative of real-world applications than the traditional requirement in RL practice that the learner receive feedback at every time step. Indeed, in many real-world applications of reinforcement learning, such as self-driving cars and robotics, it is easier to evaluate whether a learner's complete trajectory was either "good" or "bad," but harder to provide a reward signal at each step. To show that learning is possible in this more challenging setting, we study the case where trajectory labels are generated by an unknown parametric model, and provide a statistically and computationally efficient algorithm that achieves sub-linear regret.

Niladri Chatterji, Aldo Pacchiano, Peter Bartlett, Michael Jordan
Sat 10:30 a.m. - 10:42 a.m.
[ Visit Poster at Spot A2 in Virtual World ]   

We introduce a generic template for developing regret minimization algorithms in the Stochastic Shortest Path (SSP) model, which achieves minimax optimal regret as long as certain properties are ensured. The key of our analysis is a new technique called implicit finite-horizon approximation, which approximates the SSP model by a finite-horizon counterpart only in the analysis without explicit implementation. Using this template, we develop two new algorithms: the first one is model-free (the first in the literature to our knowledge) and minimax optimal under strictly positive costs; the second one is model-based and minimax optimal even with zero-cost state-action pairs, matching the best existing result from (Tarbouriech et al., 2021b). Importantly, both algorithms admit highly sparse updates, making them computationally more efficient than all existing algorithms. Moreover, both can be made completely parameter-free.

Liyu Chen, Mehdi Jafarnia, Rahul Jain, Haipeng Luo
Sat 10:45 a.m. - 10:57 a.m.
[ Visit Poster at Spot A3 in Virtual World ]   

Actor-critic methods are widely used in offline reinforcement learning practice but are understudied theoretically. In this work we show that the pessimism principle can be naturally incorporated into actor-critic formulations. We create an offline actor-critic algorithm for a linear MDP model more general than the low-rank model. The procedure is both minimax optimal and computationally tractable.

Andrea Zanette, Martin Wainwright, Emma Brunskill
Sat 11:00 a.m. - 11:25 a.m.
  

TBD

Animashree Anandkumar
Sat 11:30 a.m. - 11:55 a.m.
  

We consider the Multi-Armed Bandit (MAB) problem, where an agent sequentially chooses actions and observes rewards for the actions it took. While the majority of algorithms try to minimize the regret, i.e., the cumulative difference between the reward of the best action and the agent's action, this criterion might lead to undesirable results. For example, in large problems, or when the interaction with the environment is brief, finding an optimal arm is infeasible, and regret-minimizing algorithms tend to over-explore. To overcome this issue, algorithms for such settings should instead focus on playing near-optimal arms. To this end, we suggest a new, more lenient, regret criterion that ignores suboptimality gaps smaller than some ϵ. We then present a variant of the Thompson Sampling (TS) algorithm, called ϵ-TS, and prove its asymptotic optimality in terms of the lenient regret. Importantly, we show that when the mean of the optimal arm is high enough, the lenient regret of ϵ-TS is bounded by a constant. Finally, we show that ϵ-TS can be applied to improve the performance when the agent knows a lower bound of the suboptimality gaps.

Shie Mannor
Sat 12:00 p.m. - 12:30 p.m.
Social Session (Discussion & Chat)
Sat 12:30 p.m. - 2:00 p.m.
Poster Session - I (Poster Session)
Sat 2:00 p.m. - 2:25 p.m.
  

Policy gradient is one of the state-of-the-art algorithm families in reinforcement learning, which has been proved to be globally convergent. Motivated by properties of the accumulated reward in MDP, we propose a non-uniform refinement of the smoothness (NS) and \L{}ojasiewicz condition (N\L{}). The new definitions inspire new geometry-aware first-order policy gradient that are able to converge to global optimality in linear rate while incurring less overhead than existing algorithms, e.g., natural/mirror policy gradient. Similarly, For GLM, we show that geometry-aware normalized gradient descent can also achieve a linear convergence rate in fitting generalized linear models. Experimental results are used to illustrate and complement the theoretical findings.

Bo Dai
Sat 2:30 p.m. - 2:55 p.m.
  

We consider on-policy reinforcement learning for two-player zero-sum Markov games with simultaneous moves, which generalizes both matrix games and Markov decision processes. To ensure exploration, our algorithm needs to construct upper and lower confidence bounds for both players in each round. We show that this can be done by finding the Coarse Correlated Equilibrium of an appropriate general-sum matrix game, which can be solved in polynomial time. The algorithm applies to the offline setting where one aims to find the Nash Equilibria of the Markov game, and the online setting where one plays against an arbitrary opponent and aims to minimize regret. For both settings, we provide sample complexity and regret guarantees.

Our results generalize to Markov games with continuous and unbounded state spaces. Using linear function approximation, we develop an optimistic version of least-squares minimax value iteration, for which we provide performance guarantees that only depend on the linear dimension. This setting highlights the analytical challenges that arise due to the non-Lipschitzness of CCEs of general-sum games.

Based on https://arxiv.org/abs/2002.07066 .

Qiaomin Xie
Sat 3:00 p.m. - 3:12 p.m.
[ Visit Poster at Spot A0 in Virtual World ]   

Reinforcement learning is hard in general. Yet, in many specific environments, learning is easy. What makes learning easy in one environment, but difficult in another? We address this question by proposing a simple measure of reinforcement learning hardness called the bad-policy density. This quantity measures the fraction of the deterministic stationary policy space that is below a desired threshold in value. We prove that this simple quantity has many properties one would expect of a measure of learning hardness. Further, we prove it is NP-hard to compute the measure in general, but there are paths to polynomial-time approximation. We conclude by summarizing potential directions and uses for this measure.

Dave Abel, Cameron Allen, Dilip Arumugam, D Ellis Hershkowitz, Michael L. Littman, Lawson Wong
Sat 3:15 p.m. - 3:27 p.m.
[ Visit Poster at Spot B6 in Virtual World ]   

Real world applications such as economics and policy making often involve solving multi-agent games with two unique features: (1) The agents are inherently \emph{asymmetric} and partitioned into leaders and followers; (2) The agents have different reward functions, thus the game is \emph{general-sum}. The majority of existing results in this field focuses on either symmetric solution concepts (e.g. Nash equilibrium) or zero-sum games. It remains vastly open how to learn the \emph{Stackelberg equilibrium}---an asymmetric analog of the Nash equilibrium---in general-sum games efficiently from samples.

This paper initiates the theoretical study of sample-efficient learning of the Stackelberg equilibrium, in the bandit feedback setting where we only observe noisy samples of the reward. We consider three representative two-player general-sum games: bandit games, bandit-reinforcement learning (bandit-RL) games, and linear bandit games. In all these games, we identify a fundamental gap between the exact value of the Stackelberg equilibrium and its estimated version using finitely many noisy samples, which can not be closed information-theoretically regardless of the algorithm. We then establish sharp positive results on sample-efficient learning of Stackelberg equilibrium with value optimal up to the gap identified above, with matching lower bounds in the dependency on the gap, error tolerance, and the size of the action spaces. Overall, our results unveil unique challenges in learning Stackelberg equilibria under noisy bandit feedback, which we hope could shed light on future research on this topic.

Yu Bai, Chi Jin, Huan Wang, Caiming Xiong
Sat 3:30 p.m. - 3:42 p.m.
[ Visit Poster at Spot A5 in Virtual World ]   
The multi-armed bandit (MAB) problem is an active learning framework that aims to select the best among a set of actions by sequentially observing rewards. Recently, it has become popular for a number of applications over wireless networks, where communication constraints can form a bottleneck. Yet existing works usually fail to address this issue and can become infeasible in certain applications. In this paper, we propose QuBan, a generic reward quantization algorithm that applies to any (no-regret) multi-armed bandit algorithm. The modified algorithm requires on average a few (as low as $3$) bits to be sent per iteration, yet preserving the same regret as the original algorithm. Our upper bounds apply under mild assumptions on the reward distributions over all current (and future) MAB algorithms, including those used in contextual bandits. We also numerically evaluate the application of QuBan to widely used algorithms such as UCB and $\epsilon$-greedy.
Osama Hanna, Lin Yang, Christina Fragouli
Sat 3:45 p.m. - 3:57 p.m.
[ Visit Poster at Spot A5 in Virtual World ]   
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and meanwhile avoids violation of certain constraints on a number of expected total costs. In general, such SRL problems have nonconvex objective functions subject to multiple nonconvex constraints, and hence are very challenging to solve, particularly to provide a globally optimal policy. Many popular SRL algorithms adopt a primal-dual structure which utilizes the updating of dual variables for satisfying the constraints. In contrast, we propose a primal approach, called constraint-rectified policy optimization (CRPO), which updates the policy alternatingly between objective improvement and constraint satisfaction. CRPO provides a primal-type algorithmic framework to solve SRL problems, where each policy update can take any variant of policy optimization step. To demonstrate the theoretical performance of CRPO, we adopt natural policy gradient (NPG) for each policy update step and show that CRPO achieves an $\mathcal{O}(1/\sqrt{T})$ convergence rate to the global optimal policy in the constrained policy set and an $\mathcal{O}(1/\sqrt{T})$ error bound on constraint satisfaction. This is the first finite-time analysis of primal SRL algorithms with global optimality guarantee. Our empirical results demonstrate that CRPO can outperform the existing primal-dual baseline algorithms significantly.
Tengyu Xu, Yingbin LIANG, Guanghui Lan
Sat 4:00 p.m. - 4:25 p.m.
 link »   

Empirical likelihood (EL) is a statistical method that uses a nonparametric likelihood function. It allows one to construct confidence intervals without having to specify that the data come from some parametric distribution such as a Gaussian. Operationally, EL involves a strategic reweighting of observed data to attain its goals. This makes it similar to importance sampling and self normalized importance sampling, both widely used for off policy evaluation in reinforcement learning (RL). Recently EL has been used in off policy evaluation and in distributionally robust inference. This talk gives the basic motivation and some results about EL thought to be useful for RL: (a) EL inferences can be as or more powerful than their corresponding parametric counterparts, depending on how one keeps score. (b) There is a natural way to incorporate sampling bias via reweighting. (c) One can exploit side knowledge expresed as some known expected values. (d) An empirical likelihood can be paired with a prior distribution to get Bayesian inferences on quantities of interest without having to choose a parameteric distribution.

Sat 4:30 p.m. - 5:00 p.m.
Panel Session: Animashree Anandkumar, Christian Kroer, Art Owen, Qiaomin Xie (Discussion Panel)   
Sat 5:00 p.m. - 5:30 p.m.
Social Session (Discussion & Chat)
Sat 5:30 p.m. - 9:00 p.m.
Poster Session - II (Poster Session)
-
[ Visit Poster at Spot A0 in Virtual World ]
Regularized Markov Decision processes (MDPs) serve as a smooth version of ordinary MDPs to encourage exploration. Given a regularized MDP, however, the optimal policy is often biased when evaluating the value function. Rather than making the coefficient $\lambda$ of regularized term sufficiently small, we propose a scheme by reducing $\lambda$ to approximate the optimal policy of the original MDP. We prove that the iteration complexity to obtain an $\varepsilon$-optimal policy could be maintained or even reduced in comparison with setting a sufficiently small $\lambda$ in both dynamic programming and policy gradient methods. In addition, there exists a strong duality connection between the reduction method and solving the original MDP directly, from which we can derive more adaptive reduction methods for certain reinforcement learning algorithms.
Wenhao Yang, Xiang Li, Guangzeng Xie, Zhihua Zhang
-
[ Visit Poster at Spot A1 in Virtual World ]
The focus of this paper is on sample complexity guarantees of average-reward reinforcement learning algorithms, which are known to be more challenging to study than their discounted-reward counterparts. To the best of our knowledge, we provide the first known finite sample guarantees using both constant and diminishing step sizes of (i) average-reward TD($\lambda$) with linear function approximation for policy evaluation and (ii) average-reward $Q$-learning in the tabular setting to find an optimal policy. A major challenge is that since the value functions are agnostic to an additive constant, the corresponding Bellman operators are no longer contraction mappings under any norm. We obtain the results for TD($\lambda$) by working in an appropriately defined subspace that ensures uniqueness of the solution. For $Q$-learning, we exploit the span seminorm contractive property of the Bellman operator and construct a novel Lyapunov function obtained by infimal convolution of the generalized Moreau envelope and the indicator function of a set.
Sheng Zhang, Zhe Zhang, Siva Maguluri
-
[ Visit Poster at Spot A2 in Virtual World ]

We study the statistical theory of offline reinforcement learning (RL) with deep ReLU network function approximation. We analyze a variant of fitted-Q iteration (FQI) algorithm under a new dynamic condition that we call Besov dynamic closure, which encompasses the conditions from prior analyses for deep neural network function approximation. Under Besov dynamic closure, we prove that the FQI-type algorithm enjoys an improved sample complexity than the prior results. Importantly, our sample complexity is obtained under the new general dynamic condition and a data-dependent structure where the latter is either ignored in prior algorithms or improperly handled by prior analyses. This is the first comprehensive analysis for offline RL with deep ReLU network function approximation under a general setting.

Thanh Nguyen-Tang, Sunil Gupta, Hung Tran-The, Svetha Venkatesh
-
[ Visit Poster at Spot A3 in Virtual World ]
This paper presents the first {\em model-free}, {\em simulator-free} reinforcement learning algorithm for Constrained Markov Decision Processes (CMDPs) with sublinear regret and zero constraint violation. The algorithm is named {\em Triple-Q} because it has three key components: a Q-function (also called action-value function) for the cumulative reward, a Q-function for the cumulative utility for the constraint, and a virtual-Queue that (over)-estimates the cumulative constraint violation. Under Triple-Q, at each step, an action is chosen based on the pseudo-Q-value that is a combination of the three ``Q'' values. The algorithm updates the reward and utility Q-values with learning rates that depend on the visit counts to the corresponding (state, action) pairs and are periodically reset. In the episodic CMDP setting, Triple-Q achieves $\tilde{\cal O}\left(\frac{1 }{\delta}H^4 S^{\frac{1}{2}}A^{\frac{1}{2}}K^{\frac{4}{5}} \right)$ regret, where $K$ is the total number of episodes, $H$ is the number of steps in each episode, $S$ is the number of states, $A$ is the number of actions, and $\delta$ is Slater's constant. Furthermore, {Triple-Q} guarantees zero constraint violation when $K$ is sufficiently large. Finally, the computational complexity of {Triple-Q} is similar to SARSA for unconstrained MDPs, and is computationally efficient.
Honghao Wei, Xin Liu, Lei Ying
-
[ Visit Poster at Spot A0 in Virtual World ]

Importance Sampling (IS) is a widely used building block for a large variety of off-policy estimation and learning algorithms. However, empirical and theoretical studies have progressively shown that vanilla IS leads to poor estimations whenever the behavioral and target policies are too dissimilar. In this paper, we analyze the theoretical properties of the IS estimator by deriving a probabilistic deviation lower bound that formalizes the intuition behind its undesired behavior. Then, we propose a class of IS transformations, based on the notion of power mean, that are able, under certain circumstances, to achieve a subgaussian concentration rate. Differently from existing methods, like weight truncation, our estimator preserves the differentiability in the target distribution.

Alberto Maria Metelli, Alessio Russo, Marcello Restelli
-
[ Visit Poster at Spot A0 in Virtual World ]
We study the Stochastic Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost. In the learning formulation of the problem, the agent has no prior knowledge about the costs and dynamics of the model. She repeatedly interacts with the model for $K$ episodes, and has to minimize her regret. In this work we show that the minimax regret for this setting is $\widetilde O(\sqrt{ (B_\star^2 + B_\star) |S| |A| K})$ where $B_\star$ is a bound on the expected cost of the optimal policy from any state, $S$ is the state space, and $A$ is the action space. This matches the $\Omega (\sqrt{ B_\star^2 |S| |A| K})$ lower bound of Rosenberg et al. (2020) for $B_\star \ge 1$, and improves their regret bound by a factor of $\sqrt{|S|}$. For $B_\star < 1$ we prove a matching lower bound of $\Omega (\sqrt{ B_\star |S| |A| K})$. Our algorithm is based on a novel reduction from SSP to finite-horizon MDPs. To that end, we provide an algorithm for the finite-horizon setting whose leading term in the regret depends polynomially on the expected cost of the optimal policy and only logarithmically on the horizon.
Alon Cohen, Yonathan Efroni, Yishay Mansour, Aviv Rosenberg
-
[ Visit Poster at Spot A4 in Virtual World ]

The absence of collision information in Multi-player Multi-armed bandits (MMABs) renders arm availabilities partially observable, impeding the design of algorithms with regret guarantees that do not allow inter-player communication. In this work, we propose a collision resolution (CR) mechanism for MMABs inspired from sequential interference mechanisms employed in communication protocols. In the general case, our collision resolution mechanism assumes that players can pull multiple arms during the exploration phase. We, thus, propose a novel MMAB model that captures this while still considering strictly bandit feedback and single-pulls during the exploitation phase. We theoretically analyze the CR mechanism using tools from information theory in order to prove the existence of an upper bound on the probability of its failure that decreases at a rate exponential in the number of players.

Eleni Nisioti, Nikolaos Thomos, Boris Bellalta, Anders Jonsson
-
[ Visit Poster at Spot A1 in Virtual World ]

In this work, we propose marginalized operators, a new class of off-policy evaluation operators for reinforcement learning. Marginalized operators strictly generalize generic multi-step operators, such as Retrace, as special cases. Marginalized operators also suggest a form of sample-based estimates with potential variance reduction, compared to sample-based estimates of the original multi-step operators. We show that the estimates for marginalized operators can be computed in a scalable way, which also generalizes prior results on marginalized importance sampling as special cases.

Yunhao Tang, Mark Rowland, Remi Munos, Michal Valko
-
[ Visit Poster at Spot A2 in Virtual World ]

Prior successes in offline learning are highlighted by its adaptability to novel scenarios. One of the key reasons behind this aspect is conservatism, the act of underestimating an agent’s expected value estimates. Recent work, on the other hand, has noted that overconservatism often cripples learning of meaningful behaviors. To that end, the paper asks the question when does overconservatism hurt offline learning? The proposed answer understands conservatism in light of conjugate space and empirical instabilities. In the case of former, agents implicitly aim at learning complex high entropic distributions. As for the latter, overconservatism arises as a consequence of provably inaccurate approximations. Based on theoretical evidence, we address overconservatism through the lens of dynamic control. A feedback controller tunes the learned value estimates by virtue of direct dynamics in the compact latent space. We validate our theoretical insights in an empirical study of aerial control tasks.

Karush Suri, Florian Shkurti
-
[ Visit Poster at Spot A4 in Virtual World ]
We consider reinforcement learning (RL) in episodic Markov decision processes (MDPs) with linear function approximation under drifting environment. Specifically, both the reward and state transition functions can evolve over time, constrained so their total variations do not exceed a given \textit{variation budget}. We first develop $\texttt{LSVI-UCB-Restart}$ algorithm, an optimistic modification of least-squares value iteration combined with periodic restart, and bound its dynamic regret when variation budgets are known. We then propose a parameter-free algorithm that works without knowing the variation budgets, \texttt{Ada-LSVI-UCB-Restart}, but with a slightly worse dynamic regret bound. We also derive the first minimax dynamic regret lower bound for nonstationary linear MDPs to show our proposed algorithms are near-optimal. As a byproduct, we establish a minimax regret lower bound for linear MDPs, which was unsolved by~\citet{jin2020provably}.
Huozhi Zhou, Jinglin Chen, Lav Varshney, Ashish Jagmohan
-
[ Visit Poster at Spot A6 in Virtual World ]
Policy evaluation (including multi-step off-policy importance sampling) has the interpretation of solving a generalized Bellman equation. In this paper, we derive finite-sample bounds for any general off-policy TD-like stochastic approximation algorithm that solves for the fixed point of this generalized Bellman operator. Our key step is to show that the generalized Bellman operator is simultaneously a contraction mapping with respect to a weighted $\ell_p$-norm for each $p$ in $[1,\infty)$, with a common contraction factor. Our results immediately imply finite-sample bounds of variants of off-policy TD-learning algorithms in the literature (e.g. $Q^\pi(\lambda)$, Tree-Backup, Retrace, and $Q$-trace).
Zaiwei Chen, Siva Maguluri, Sanjay Shakkottai, Karthikeyan Shanmugam
-
[ Visit Poster at Spot C6 in Virtual World ]

Policy-based model-free reinforcement learning (RL) methods have shown great promise for continuous control applications. However, their performances on risk-sensitive/robust control tasks have not been fully understood, which has been generally considered to be one important open problem in the seminal work (Fazel et al., 2018). We make a step toward addressing this open problem, by providing the first sample complexity results for policy gradient (PG) methods in two fundamental risk-sensitive/robust control settings: the linear exponential quadratic Gaussian, and the linear-quadratic (LQ) disturbance attenuation problems. The optimization landscapes for these problems are by nature more challenging than that of the LQ regulator problem, due to lack of coercivity of their objective functions. To overcome this challenge, we obtain the first \emph{implicit regularization} results for model-free PG methods, certifying that the controller \emph{remains robust} during the learning process, which further lead to the sample complexity guarantees. As a by-product, our results also provide the first sample complexity of PG methods in two-player zero-sum LQ dynamic games, a baseline in multi-agent RL.

Kaiqing Zhang, Xiangyuan Zhang, Bin Hu, Tamer Basar
-
[ Visit Poster at Spot A1 in Virtual World ]

Agents trained by reinforcement learning (RL) often fail to generalize beyond the environment they were trained in, even when presented with new scenarios that seem similar to the training environment. We study the query complexity required to train RL agents that generalize to multiple environments. Intuitively, tractable generalization is only possible when the environments are similar or close in some sense. To capture this, we introduce Weak Proximity, a natural structural condition that requires the environments to have highly similar transition and reward functions and share a policy providing optimal value. Despite such shared structure, we prove that tractable generalization is impossible in the worst case. This holds even when each individual environment can be efficiently solved to obtain an optimal linear policy, and when the agent possesses a generative model. Our lower bound applies to the more complex task of representation learning for efficient generalization to multiple environments. On the positive side, we introduce Strong Proximity, a strengthened condition which we prove is sufficient for efficient generalization.

Dhruv Malik, Yuanzhi Li, Pradeep Ravikumar
-
[ Visit Poster at Spot B0 in Virtual World ]
In this paper, we develop a novel variant of off-policy natural actor-critic algorithm with linear function approximation and we establish a sample complexity of $\mathcal{O}(\epsilon^{-3})$, outperforming all the previously known convergence bounds of such algorithms. In order to overcome the divergence due to deadly triad in off-policy policy evaluation under function approximation, we develop a critic that employs $n$-step TD-learning algorithm with a properly chosen $n$. We derive our sample complexity bounds solely based on the assumption that the behavior policy sufficiently explores all the states and actions, which is a much lighter assumption compared to the related literature.
Zaiwei Chen, sajad khodadadian, Siva Maguluri
-
[ Visit Poster at Spot C3 in Virtual World ]

In the maximum state entropy exploration framework, an agent interacts with a reward-free environment to learn a policy that maximizes the entropy of the expected state visitations it is inducing. Hazan et al. (2019) noted that the class of Markovian stochastic policies is sufficient for the maximum state entropy objective, and exploiting non-Markovianity is generally considered pointless in this setting. In this paper, we argue that non-Markovianity is instead paramount for maximum state entropy exploration in a finite-sample regime. Especially, we recast the objective to target the expected entropy of the induced state visitations in a single trial. Then, we show that the class of non-Markovian deterministic policies is sufficient for the introduced objective, while Markovian policies suffer non-zero regret in general. However, we prove that the problem of finding an optimal non-Markovian policy is NP-hard. Despite this negative result, we discuss avenues to address the problem in a tractable way in future works.

Mirco Mutti, Riccardo De Santi, Marcello Restelli
-
[ Visit Poster at Spot C4 in Virtual World ]

Potential games are arguably one of the most important and widely studied classes of normal form games. They define the archetypal setting of multi-agent coordination as all agent utilities are perfectly aligned with each other via a common potential function. Can this intuitive framework be transplanted in the setting of Markov Games? What are the similarities and differences between multi-agent coordination with and without state dependence? We present a novel definition of Markov Potential Games (MPG) that generalizes prior attempts at capturing complex stateful multi-agent coordination. Counter-intuitively, insights from normal-form potential games do not carry over as MPGs can consist of settings where state-games can be zero-sum games. In the opposite direction, Markov games where every state-game is a potential game are not necessarily MPGs. Nevertheless, MPGs showcase standard desirable properties such as the existence of deterministic Nash policies. In our main technical result, we prove fast convergence of independent policy gradient to Nash policies by adapting recent gradient dominance property arguments developed for single agent MDPs to multi-agent learning settings.

Stefanos Leonardos, Will Overman, Ioannis Panageas, Georgios Piliouras
-
[ Visit Poster at Spot A2 in Virtual World ]

The reward function is widely accepted as a succinct, robust, and transferable representation of a task. Typical approaches, at the basis of Inverse Reinforcement Learning (IRL), leverage expert demonstrations to recover a reward function. In this paper, we study the theoretical properties of the class of reward functions that are compatible with the expert's behavior. We analyze how the limited knowledge of the expert's policy and the environment affects the reward reconstruction phase. Then, we examine how the error propagates to the learned policy's performance when transferring the reward function to a different environment. We employ these findings to devise a provably efficient active sampling approach, aware of the need for transferring the reward function that can be paired with a large variety of IRL algorithms.

Giorgia Ramponi, Alberto Maria Metelli, Marcello Restelli
-
[ Visit Poster at Spot A3 in Virtual World ]

We consider a reinforcement learning (RL) framework where the RL agent learns to adjust the accuracy of the observations alongside learning to perform the original task. We are interested in revealing the information structure of the observation space illustrating which type of observations are the most important (such as position versus velocity) and the dependence of this on the state of agent (such as at the bottom versus top of a hill). We approach this problem by associating a cost with collecting observations which increases with the accuracy. In contrast to the existing work that mostly focuses on sample efficiency during training, our focus is on the behaviour of the agent during the actual task. Our results quantify how the RL agent can learn to use the observation space efficiently and obtain satisfactory performance in the original task while collecting effectively smaller amount of data.

Mehmet Koseoglu, Ece Kunduracioglu, Ayca Ozcelikkale
-
[ Visit Poster at Spot B4 in Virtual World ]

Reinforcement learning (RL) is empirically successful in complex nonlinear Markov decision processes (MDPs) with continuous state spaces. By contrast, the majority of theoretical RL literature requires the MDP to satisfy some form of linear structure, in order to guarantee sample efficient RL. Such efforts typically assume the transition dynamics or value function of the MDP are described by linear functions of the state features. To resolve this discrepancy between theory and practice, we introduce the Effective Planning Window (EPW) condition, a structural condition on MDPs that makes \emph{no} linearity assumptions. We demonstrate that the EPW condition permits sample efficient RL, by providing an algorithm which provably solves MDPs satisfying this condition. Our algorithm requires minimal assumptions on the policy class, which can include multi-layer neural networks with nonlinear activation functions. Notably, the EPW condition is directly motivated by popular gaming benchmarks, and we show that many classic Atari games satisfy this condition. We additionally show the necessity of conditions like EPW, by demonstrating that simple MDPs with slight nonlinearities cannot be solved sample efficiently.

Dhruv Malik, Aldo Pacchiano, Vishwak Srinivasan, Yuanzhi Li
-
[ Visit Poster at Spot D0 in Virtual World ]

Actor-critic methods have been successfully applied to several high dimensional continuous control tasks. Despite their success, they are prone to overestimation bias that leads to sub-optimal policies and divergent behaviour. Algorithms like TD3 and Soft Actor Critic (SAC) address overestimation bias by employing twin Q functions and optimizing the policy with respect to the lower bound of the two Q functions. Although this resolves the issue of overestimation bias, it inadvertently introduces underestimation bias. Underestimation bias, though not as problematic as overestimation bias, affects the asymptotic performance of the algorithms. To address both overestimation and underestimation bias in actor critic methods, we propose Bagged Critic for Continuous Control (BC3). BC3 uses an ensemble of independently trained Q functions as critic to address estimation biases. We present theoretical bounds on the biases and convergence analysis of our method demonstrating its benefits. The empirical results on several challenging reinforcement learning benchmarks substantiate our theoretical analysis and demonstrate reduction in biases with overall more robust policies.

Payal Bawa
-
[ Visit Poster at Spot B5 in Virtual World ]

We study the role of the representation in finite-horizon Markov Decision Processes (MDPs) with linear structure. We provide a necessary condition for achieving constant regret in any MDP with linear reward representation (even with known dynamics). This result encompasses the well-known scenario of low-rank MDPs and, more generally, zero inherent Bellman error. We demonstrate that this condition is not only necessary but also sufficient for these classes, by deriving a constant regret bound for two optimistic algorithms. As far as we know, this is the first constant regret result for MDPs. Finally, we study the problem of representation selection showing that our proposed algorithm achieves constant regret when one of the given representations is "good". Furthermore, our algorithm can combine representations and achieve constant regret also when none of the representations would.

Matteo Papini, Andrea Tirinzoni, Aldo Pacchiano, Marcello Restelli, Alessandro Lazaric, Matteo Pirotta
-
[ Visit Poster at Spot B3 in Virtual World ]
We derive a novel asymptotic problem-dependent lower-bound for regret minimization in finite-horizon tabular Markov Decision Processes (MDPs). While, similar to prior work (e.g., for ergodic MDPs), the lower-bound is the solution to an optimization problem, our derivation reveals the need for an additional constraint on the visitation distribution over state-action pairs that explicitly accounts for the dynamics of the MDP. We provide a characterization of our lower-bound through a series of examples illustrating how different MDPs may have significantly different complexity. 1) We first consider a ``difficult'' MDP instance, where the novel constraint based on the dynamics leads to a larger lower-bound (i.e., a larger regret) compared to the classical analysis. 2) We then show that our lower-bound recovers results previously derived for specific MDP instances. 3) Finally, we show that, in certain ``simple'' MDPs, the lower bound is considerably smaller than in the general case and it does not scale with the minimum action gap at all. We show that this last result is attainable (up to $poly(H)$ terms) by providing a regret upper-bound based on policy gap for an optimistic algorithm.
Andrea Tirinzoni, Matteo Pirotta, Alessandro Lazaric
-
[ Visit Poster at Spot B4 in Virtual World ]

Linear fixed point equations in Hilbert spaces naturally arise from the policy evaluation problem in reinforcement learning. We study methods that use a collection of random observations to compute approximate solutions by searching over a known low-dimensional subspace of the Hilbert space. First, we prove an instance-dependent upper bound on the mean-squared error for a linear stochastic approximation scheme that exploits Polyak--Ruppert averaging. This bound consists of two terms: an approximation error term with an instance-dependent approximation factor, and a statistical error term that captures the instance-specific complexity of the noise when projected onto the low-dimensional subspace. Using information-theoretic methods, we also establish lower bounds showing that the approximation factor cannot be improved, again in an instance-dependent sense. A concrete consequence of our characterization is that the optimal approximation factor in this problem can be much larger than a universal constant. We show how our results precisely characterize the error of a class of temporal difference learning methods for the policy evaluation problem with linear function approximation, establishing their optimality.

Wenlong Mou, Ashwin Pananjady, Martin Wainwright
-
[ Visit Poster at Spot A4 in Virtual World ]

Optimism in the face of uncertainty is a principled approach for provably efficient exploration for reinforcement learning in tabular and linear settings. However, such an approach is challenging in developing practical exploration algorithms for Deep Reinforcement Learning (DRL). To address this problem, we propose an Optimistic Exploration algorithm with Backward Bootstrapped Bonus (OEB3) for DRL. We construct an UCB-bonus indicating the uncertainty of Q-functions. The UCB-bonus is further utilized to estimate an optimistic Q-value, which encourages the agent to explore the scarcely visited states and actions to reduce uncertainty. In the estimation of Q-function, we adopt an episodic backward update strategy to propagate the future uncertainty to the estimated Q-function consistently. Experiments show that OEB3 outperforms several state-of-the-art exploration approaches 49 Atari games.

Chenjia Bai, Lingxiao Wang, Lei Han, Jianye Hao, Animesh Garg, Peng Liu, Zhaoran Wang
-
[ Visit Poster at Spot B1 in Virtual World ]

Reward-Weighted Regression (RWR) belongs to a family of widely known iterative Reinforcement Learning algorithms based on the Expectation-Maximization framework. In this family, learning at each iteration consists of sampling a batch of trajectories using the current policy and fitting a new policy to maximize a return-weighted log-likelihood of actions. Although RWR is known to yield monotonic improvement of the policy under certain circumstances, whether and under which conditions RWR converges to the optimal policy have remained open questions. In this paper, we provide for the first time a proof that RWR converges to a global optimum when no function approximation is used.

Francesco Faccio, Rupesh Kumar Srivastava, Jürgen Schmidhuber
-
[ Visit Poster at Spot B5 in Virtual World ]

In batch reinforcement learning, there can be poorly explored state-action pairs resulting in poorly learned, inaccurate models and poorly performing associated policies. Various regularization methods can mitigate the problem of learning overly-complex models in Markov decision processes (MDPs), however they operate in technically and intuitively distinct ways and lack a common form in which to compare them. This paper unifies three regularization methods in a common framework-- a weighted average transition matrix. Considering regularization methods in this common form illuminates how the MDP structure and the state-action pair distribution of the batch data set influence the relative performance of regularization methods. We confirm intuitions generated from the common framework by empirical evaluation across a range of MDPs and data collection policies.

Sarah Rathnam
-
[ Visit Poster at Spot B3 in Virtual World ]

We study regret minimization in non-episodic factored Markov decision processes (FMDPs), where all existing algorithms make the strong assumption that the factored structure of the FMDP is known to the learner in advance. In this paper, we provide the first algorithm that learns the structure of the FMDP while minimizing the regret. Our algorithm is based on the optimism in face of uncertainty principle, combined with a simple statistical method for structure learning, and can be implemented efficiently given oracle-access to an FMDP planner. Moreover, we give a variant of our algorithm that remains efficient even when the oracle is limited to non-factored actions, which is the case with almost all existing approximate planners. Finally, we leverage our techniques to prove a novel lower bound for the known structure case, closing the gap to the regret bound of Chen et al. [2021].

Aviv Rosenberg, Yishay Mansour
-
[ Visit Poster at Spot A5 in Virtual World ]
Reinforcement learning typically assumes that the agent observes feedback for its actions immediately, but in many real-world applications (like recommendation systems) the feedback is observed in delay. In this paper, we study online learning in episodic Markov decision processes (MDPs) with unknown transitions, adversarially changing costs and unrestricted delayed feedback. That is, the costs and trajectory of episode $k$ are revealed to the learner only in the end of episode $k + d^k$, where the delays $d^k$ are neither identical nor bounded, and are chosen by an oblivious adversary. We present novel algorithms based on policy optimization that achieve near-optimal high-probability regret of $\sqrt{K + D}$ under full-information feedback, where $K$ is the number of episodes and $D = \sum_{k} d^k$ is the total delay. Under bandit feedback, we prove similar $\sqrt{K + D}$ regret assuming the costs are stochastic, and $(K + D)^{2/3}$ regret in the general case. We are the first to consider regret minimization in the important setting of MDPs with delayed feedback.
Tal Lancewicki, Aviv Rosenberg, Yishay Mansour
-
[ Visit Poster at Spot A6 in Virtual World ]

Generalization is a central challenge for the deployment of reinforcement learning (RL) systems in the real world. In this paper, we show that the sequential structure of the RL problem necessitates new approaches to generalization beyond the well-studied techniques used in supervised learning. While supervised learning methods can generalize effectively without explicitly accounting for epistemic uncertainty, we describe why appropriate uncertainty handling can actually be essential in RL. We show that generalization to unseen test conditions from a limited number of training conditions induces a kind of implicit partial observability, effectively turning even fully-observed MDPs into POMDPs. Informed by this observation, we recast the problem of generalization in RL as solving the induced partially observed Markov decision process, which we call the epistemic POMDP. We demonstrate the failure modes of algorithms that do not appropriately handle this partial observability, and suggest a simple ensemble-based technique for approximately solving the partially observed problem. Empirically, we demonstrate that our simple algorithm derived from the epistemic POMDP achieves significant gains in generalization over current methods on the Procgen benchmark suite.

Dibya Ghosh, Jad Rahme, Aviral Kumar, Amy Zhang, Ryan P. Adams, Sergey Levine
-
[ Visit Poster at Spot B0 in Virtual World ]

Bandit algorithms are increasingly used in real-world sequential decision-making problems. Associated with this is an increased desire to be able to use the resulting datasets to answer scientific questions like: Did one type of ad lead to more purchases? In which contexts is a mobile health intervention effective? However, classical statistical approaches fail to provide valid confidence intervals when used with data collected with bandit algorithms. Alternative methods have recently been developed for simple models (e.g., comparison of means). Yet there is a lack of general methods for conducting statistical inference using more complex models on data collected with (contextual) bandit algorithms; for example, current methods cannot be used for valid inference on parameters in a logistic regression model for a binary reward. In this work, we develop theory justifying the use of M-estimators---which includes estimators based on empirical risk minimization as well as maximum likelihood---on data collected with adaptive algorithms, including (contextual) bandit algorithms. Specifically, we show that M-estimators, modified with particular adaptive weights, can be used to construct asymptotically valid confidence regions for a variety of inferential targets.

Kelly Zhang, Lucas Janson, Susan Murphy
-
[ Visit Poster at Spot C2 in Virtual World ]
Policy Optimization (PO) methods with function approximation are one of the most popular classes of Reinforcement Learning (RL) algorithms. However, designing provably efficient policy optimization algorithms remains a challenge. Recent work in this area has focused on incorporating upper confidence bound (UCB)-style bonuses to drive exploration in policy optimization. In this paper, we present Randomized Least Squares Policy Optimization (RLSPO) which is inspired by Thompson Sampling. We prove that, in an episodic linear kernel MDP setting, RLSPO achieves $\Tilde{\mathcal{O}}(d^{3/2} H^{3/2} \sqrt{T})$ worst-case (frequentist) regret, where $H$ is the number of episodes, $T$ is the total number of steps and $d$ is the feature dimension. Finally, we evaluate RLSPO empirically and show that it is competitive with existing provably efficient PO algorithms.
Haque Ishfaq, Zhuoran Yang, Andrei Lupu, Viet Nguyen, Lewis Liu, Riashat Islam, Zhaoran Wang, Doina Precup
-
[ Visit Poster at Spot C1 in Virtual World ]
For the problem of task-agnostic reinforcement learning (RL), an agent first collects samples from an unknown environment without the supervision of reward signals, then is revealed with a reward and is asked to compute a corresponding near-optimal policy. Existing approaches mainly concern the worst-case scenarios, in which no structural information of the reward/transition-dynamics is utilized. Therefore the best sample upper bound is $\propto\widetilde{\mathcal{O}}(1/\epsilon^2)$, where $\epsilon>0$ is the target accuracy of the obtained policy, and can be overly pessimistic. To tackle this issue, we provide an efficient algorithm that utilizes a gap parameter, $\rho>0$, to reduce the amount of exploration. In particular, for an unknown finite-horizon Markov decision process, the algorithm takes only $\widetilde{\mathcal{O}} (1/\epsilon \cdot (H^3SA / \rho + H^4 S^2 A) )$ episodes of exploration, and is able to obtain an $\epsilon$-optimal policy for a post-revealed reward with sub-optimality gap at least $\rho$, where $S$ is the number of states, $A$ is the number of actions, and $H$ is the length of the horizon, obtaining a nearly \emph{quadratic saving} in terms of $\epsilon$. We show that, information-theoretically, this bound is nearly tight for $\rho < \Theta(1/(HS))$ and $H>1$. We further show that $\propto\widetilde{\mathcal{O}}(1)$ sample bound is possible for $H=1$ (i.e., multi-armed bandit) or with a sampling simulator, establishing a stark separation between those settings and the RL setting.
Jingfeng Wu, Vladimir Braverman, Lin Yang
-
[ Visit Poster at Spot A1 in Virtual World ]
We consider the problem of online reinforcement learning for the Stochastic Shortest Path (SSP) problem modeled as an unknown MDP with an absorbing state. We propose PSRL-SSP, a simple posterior sampling-based reinforcement learning algorithm for the SSP problem. The algorithm operates in epochs. At the beginning of each epoch, a sample is drawn from the posterior distribution on the unknown model dynamics, and the optimal policy with respect to the drawn sample is followed during that epoch. An epoch completes if either the number of visits to the goal state in the current epoch exceeds that of the previous epoch, or the number of visits to any of the state-action pairs is doubled. We establish a Bayesian regret bound of $O(B_* S\sqrt{AK})$, where $B_*$ is an upper bound on the expected cost of the optimal policy, $S$ is the size of the state space, $A$ is the size of the action space, and $K$ is the number of episodes. The algorithm only requires the knowledge of the prior distribution, and has no hyper-parameters to tune. It is the first such posterior sampling algorithm and outperforms numerically previously proposed optimism-based algorithms.
Mehdi Jafarnia, Liyu Chen, Rahul Jain, Haipeng Luo
-
[ Visit Poster at Spot A6 in Virtual World ]

This paper studies model-based bandit and reinforcement learning (RL) with nonlinear function approximations. We propose to study convergence to approximate local maxima because we show that global convergence is statistically intractable even for one-layer neural net bandit with a deterministic reward. For both nonlinear bandit and RL, the paper presents a model-based algorithm, Virtual Ascent with Online Model Learner (ViOlin), which provably converges to a local maximum with sample complexity that only depends on the sequential Rademacher complexity of the model class. Our bounds imply novel results on several concrete settings such as linear bandit with finite model class or sparse models, and two-layer neural net bandit. A key algorithmic insight is that optimism may lead to overexploration even for one-layer neural net model class. On the other hand, for convergence to local maxima, it suffices to maximize the virtual return if the model can also predict the size of the gradient and Hessian of the return.

Kefan Dong, Jiaqi Yang, Tengyu Ma
-
[ Visit Poster at Spot B2 in Virtual World ]

Natural policy gradient (NPG) methods with function approximation achieve impressive empirical success in reinforcement learning problems with large state-action spaces. However, theoretical understanding of their convergence behaviors remains limited in the function approximation setting. In this paper, we perform a finite-time analysis of NPG with linear function approximation and softmax parameterization, and prove for the first time that widely used entropy regularization method, which encourages exploration, leads to linear convergence rate. We adopt a Lyapunov drift analysis to prove the convergence results and explain the effectiveness of entropy regularization in improving the convergence rates.

Semih Cayci, Niao He, R Srikant
-
[ Visit Poster at Spot C3 in Virtual World ]

We study multi-agent reinforcement learning (MARL) in infinite-horizon discounted zero-sum Markov games. We focus on the practical but challenging setting of decentralized MARL, where agents make decisions without coordination by a centralized controller, but only based on their own payoffs and local actions executed. The agents need not observe the opponent’s actions or payoffs, possibly being even oblivious to the presence of the opponent, nor be aware of the zero-sum structure of the underlying game, a setting also referred to as radically uncoupled in the literature of learning in games. In this paper, we develop for the first time a radically uncoupled Q-learning dynamics that is both rational and convergent: the learning dynamics converges to the best response to the opponent’s strategy when the opponent follows an asymptotically stationary strategy; the value function estimates converge to the payoffs at a Nash equilibrium when both agents adopt the dynamics. The key challenge in this decentralized setting is the non-stationarity of the learning environment from an agent’s perspective, since both her own payoffs and the system evolution depend on the actions of other agents, and each agent adapts their policies simultaneously and independently. To address this issue, we develop a two-timescale learning dynamics where each agent updates her local Q-function and value function estimates concurrently, with the latter happening at a slower timescale.

Kaiqing Zhang, David Leslie, Tamer Basar, Asuman Ozdaglar
-
[ Visit Poster at Spot A4 in Virtual World ]

In this paper we propose a model-based offline reinforcement learning algorithm that explicitly handles model misspecification and distribution mismatch. Theoretically, we prove a safe policy improvement theorem by establishing pessimism approximations to the value function. Our algorithm can output the best policy in the given policy class with interpretable error terms measuring misspecification level, distribution mismatch, and statistical deviation. In addition, as long as the model family can approximate the transitions of state-action pairs visits by a policy, we can approximate the value of that particular policy. We visualize the effect of error terms in the LQR setting, and show that the experiment results match our theory.

Kefan Dong, Ramtin Keramati, Emma Brunskill
-
[ Visit Poster at Spot B4 in Virtual World ]
We study reinforcement learning in an infinite-horizon average-reward setting with linear function approximation, where the transition probability function of the underlying Markov Decision Process (MDP) admits a linear form over a feature mapping of the current state, action and next state. We propose a new algorithm UCRL2-VTR, which can been seen as an extension of UCRL2 algorithm with linear function approximation. We show that UCRL2-VTR with Bernstein-type bonus can achieve a regret of $\tilde{O}(d\sqrt{DT})$, where $d$ is the dimension of the feature mapping, $T$ is the horizon and $D$ is the diameter of the MDP. We also prove a matching lower bound $\tilde{\Omega}(d\sqrt{DT})$, which suggests that the proposed UCRL2-VTR is minimax optimal up to logarithmic factors. To the best of our knowledge, our algorithm is the first nearly minimax optimal RL algorithm with function approximation in the infinite-horizon average-reward setting.
Yue Wu, Dongruo Zhou, Quanquan Gu
-
[ Visit Poster at Spot B0 in Virtual World ]

We introduce the ``inverse bandit'' problem of estimating the rewards of a multi-armed bandit instance from observing the learning process of a low-regret demonstrator. Existing approaches to the related problem of inverse reinforcement learning assume the execution of an optimal policy, and thereby suffer from an identifiability issue. In contrast, our paradigm leverages the demonstrator's behavior en route to optimality, and in particular, the exploration phase, to obtain consistent reward estimates. We develop simple and efficient reward estimation procedures for demonstrations within a class of upper-confidence-based algorithms, showing that reward estimation gets progressively easier as the regret of the algorithm increases. We match these upper bounds with information-theoretic lower bounds that apply to any demonstrator algorithm, thereby characterizing the optimal tradeoff between exploration and reward estimation. Extensive simulations on both synthetic and semi-synthetic data corroborate our theoretical results.

Wenshuo Guo, Kumar Agrawal, Aditya Grover, Vidya Muthukumar, Ashwin Pananjady
-
[ Visit Poster at Spot B1 in Virtual World ]
Policy evaluation is an important instrument for the comparison of different algorithms in Reinforcement Learning (RL). Yet even a precise knowledge of the value function $V^{\pi}$ corresponding to a policy $\pi$ does not provide reliable information on how far is the policy $\pi$ from the optimal one. We present a novel model-free upper value iteration procedure ({\sf UVIP}) that allows us to estimate the suboptimality gap $V^{\star}(x) - V^{\pi}(x)$ from above and to construct confidence intervals for $V^\star$. Our approach relies on upper bounds to the solution of the Bellman optimality equation via martingale approach. We provide theoretical guarantees for {\sf UVIP} under general assumptions and illustrate its performance on a number of benchmark RL problems.
Denis Belomestny, Ilya Levin, Eric Moulines, Alexey Naumov, Sergey Samsonov, Veronika Zorina
-
[ Visit Poster at Spot B2 in Virtual World ]

Many real-world applications of reinforcement learning (RL) require the agent to deal with high-dimensional observations such as those generated from a megapixel camera. Prior work has addressed such problems with representation learning, through which the agent can provably extract endogenous, latent state information from raw observations and subsequently plan efficiently. However, such approaches can fail in the presence of structured noise that is temporally correlated, a phenomenon that is common in practice. We initiate the formal study of latent state discovery in the presence of such exogenous noise sources by proposing a new model, the Exogenous Block MDP, for rich observation RL. We establish structural results and provide new sample efficiency guarantees for learning in these models. For the latter, we show that the new DPCID variant of PCID, previously studied by~\citet{du2019provably}, learns a generalization of inverse dynamics and is provably sample and computationally efficient in Exogenous Block MDPs when the endogenous state dynamics are near deterministic. The sample complexity of DPCID depends polynomially on the size of a latent endogenous state space while not directly depending on the size of the observation space. We provide experiments on challenging exploration problems and show the promise of our approach.

Yonathan Efroni, Dipendra Misra, Akshay Krishnamurthy, Alekh Agarwal, John Langford
-
[ Visit Poster at Spot C2 in Virtual World ]
We study cooperative multi-agent reinforcement learning in episodic Markov games with $n$ agents. While the global state and action spaces typically grow exponentially in $n$ in this setting, in this paper, we introduce a novel framework that combines function approximation and a graphical dependence structure that restricts the decision space to $o(dn)$ for each agent, where $d$ is the ambient dimensionality of the problem. We present a multi-agent value iteration algorithm that, under mild assumptions, provably recovers the set of Pareto-optimal policies, with finite-sample guarantees on the incurred regret. Furthermore, we demonstrate that our algorithm is {\em no-regret} even when there are only $\cO(1)$ episodes with communication, providing a scalable and provably no-regret algorithm for multi-agent reinforcement learning with function approximation. Our work provides a tractable approach to multi-agent decision-making that is provably efficient and amenable to large-scale collaborative systems.
Abhimanyu Dubey, Alex `Sandy' Pentland
-
[ Visit Poster at Spot A5 in Virtual World ]
This work studies the statistical limits of uniform convergence for offline policy evaluation (OPE) problems with model-based methods (for episodic MDP) and provides a unified framework towards optimal learning for several well-motivated offline tasks. We establish an $\Omega(H^2 S/d_m\epsilon^2)$ lower bound (over model-based family) for the global uniform OPE and our main result establishes an upper bound of $\tilde{O}(H^2/d_m\epsilon^2)$ for the \emph{local} uniform convergence. The highlight in achieving the optimal rate $\tilde{O}(H^2/d_m\epsilon^2)$ is our design of \emph{singleton absorbing MDP}, which is a new sharp analysis tool that works with the model-based approach. We generalize such a model-based framework to the new settings: offline task-agnostic and the offline reward-free with optimal complexity $\tilde{O}(H^2\log(K)/d_m\epsilon^2)$ ($K$ is the number of tasks) and $\tilde{O}(H^2S/d_m\epsilon^2)$ respectively. These results provide a unified solution for simultaneously solving different offline RL problems.
Ming Yin, Yu-Xiang Wang
-
[ Visit Poster at Spot B3 in Virtual World ]

In this paper, we study privacy in the context of finite-horizon Markov Decision Processes. Two notions of privacy have been investigated in this setting: joint differential privacy (JDP) and local differential privacy (LDP). We show that it is possible to achieve a smooth transition in terms of privacy and regret (i.e., utility) between JDP and LDP. By leveraging shuffling techniques, we present an algorithm that, depending on the provided parameter, is able to attain any privacy/utility value in between the pure JDP and LDP guarantee.

Evrard Garcelon, Vianney Perchet, Ciara Pike-Burke, Matteo Pirotta
-
[ Visit Poster at Spot B5 in Virtual World ]
We consider the problem of offline reinforcement learning (RL) --- a well-motivated setting of RL that aims at policy optimization using only historical data. In this paper, we propose \emph{Off-Policy Double Variance Reduction} (OPDVR), a new variance reduction based algorithm for offline RL. Our main result shows that OPDVR provably identifies an $\epsilon$-optimal policy with $\widetilde{O}(H^2/d_m\epsilon^2)$ episodes of offline data in the finite-horizon \emph{stationary transition} setting and this improves over the best known upper bound by a factor of $H$. Moreover, we establish an information-theoretic lower bound of $\Omega(H^2/d_m\epsilon^2)$ which certifies that OPDVR is optimal up to logarithmic factors. Lastly, we show that OPDVR also achieves rate-optimal sample complexity under alternative settings such as the finite-horizon MDPs with non-stationary transitions and the infinite horizon MDPs with discounted rewards.
Ming Yin, Yu Bai, Yu-Xiang Wang
-
[ Visit Poster at Spot B4 in Virtual World ]

The concept of utilizing multi-step returns for updating value functions has long been adopted in the reinforcement learning domain. Conventional methods such as TD-lambda further extend this concept and use a single target value equivalent to an exponential average of different step returns. Nevertheless, different backup lengths provide diverse advantages in terms of bias and variance of value estimates, convergence speeds, and learning behaviors of the agent. Integrating step returns into a single target sacrifices the advantages offered by different step return targets. In order to address this issue, we propose Mixture Bootstrapped DQN (MB-DQN) and employ different backup lengths for different bootstrapped heads. MB-DQN enables diversity of the target values that is unavailable in approaches relying only on a single target value. In this paper, we first comprehensively highlight the motivational insights through a simple maze environment. Then, in order to validate the effectiveness of MB-DQN, we perform experiments on various Atari 2600 benchmark environments, and demonstrate the performance improvement of MB-DQN over the baseline methods. Finally, we provide ablation analyses to verify MB-DQN in a set of analytical cases.

PoHan Chiang, Hsuan-Kung Yang, Zhang-Wei Hong, Chun-Yi Lee
-
[ Visit Poster at Spot B2 in Virtual World ]
Learning Markov decision processes (MDPs) in the presence of the adversary is a challenging problem in reinforcement learning (RL). In this paper, we study RL in episodic MDPs with adversarial reward and full information feedback, where the unknown transition probability function is a linear function of a given feature mapping, and the reward function can change arbitrarily episode by episode. We propose an optimistic policy optimization algorithm POWER and show that it can achieve $\tilde{O}(dH\sqrt{T})$ regret, where $H$ is the length of the episode, $T$ is the number of interaction with the MDP, and $d$ is the dimension of the feature mapping. Furthermore, we also prove a matching lower bound of $\tilde{\Omega}(dH\sqrt{T})$ up to logarithmic factors. Our key technical contributions are two-fold: (1) a new value function estimator based on importance weighting; and (2) a tighter confidence set for the transition kernel. They together lead to the nearly minimax optimal regret.
Jiafan He, Dongruo Zhou, Quanquan Gu
-
[ Visit Poster at Spot B5 in Virtual World ]

Sample complexity and robustness are critical for applying reinforcement learning (RL) algorithms in real-world applications. We study the sample saving opportunities via transferring experience when the source domain is implemented by a simulator. Many real-world domains are well approximated by rich-observation models'' where the agent receives a high-dimensionalrich'' observation, which is however emitted from a compact latent state space. For such problems, designing simulators that can accurately model the emission process of the observations is challenging. In this paper, we address these issues by considering learning from abstract simulators that only model the latent state space and a deterministic approximation of the latent transition dynamics. We present a transfer RL algorithm POTAS that learns a policy robust to perturbation in the target domain, with a sample complexity that is independent of the size of state space (exploration-free), by leveraging an abstract simulator. We also present lower bounds showing that without the near-deterministic assumption, one cannot learn a robust policy from abstract simulators and also avoid dependence on the state space.

Yao Liu, Dipendra Misra, Miro Dudik, Robert Schapire
-
[ Visit Poster at Spot A3 in Virtual World ]
We study the problem of learning in the stochastic shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state. We design a novel model-based algorithm EB-SSP that carefully skews the empirical transitions and perturbs the empirical costs with an exploration bonus to induce an optimistic SSP problem whose associated value iteration scheme is guaranteed to converge. We prove that EB-SSP achieves the minimax regret rate $\widetilde{O}(B_{\star} \sqrt{S A K})$, where $K$ is the number of episodes, $S$ is the number of states, $A$ is the number of actions, and $B_{\star}$ bounds the expected cumulative cost of the optimal policy from any state, thus closing the gap with the lower bound. Interestingly, EB-SSP obtains this result while being parameter-free, i.e., it does not require any prior knowledge of $B_{\star}$, nor of $T_{\star}$, which bounds the expected time-to-goal of the optimal policy from any state. We also show various cases (e.g., positive costs, or general costs when an order-accurate estimate of $T_{\star}$ is available) where the regret only contains a logarithmic dependence on $T_{\star}$, thus yielding the first (nearly) horizon-free regret bound beyond the finite-horizon MDP setting.
Jean Tarbouriech, Jean Tarbouriech, Simon Du, Matteo Pirotta, Michal Valko, Alessandro Lazaric
-
[ Visit Poster at Spot A6 in Virtual World ]

We consider the off-policy evaluation problem in POMDPs. Prior work on this problem uses a causal identification strategy based on one-step observable proxies of the hidden state (Tennenholtz et al., 2020a). In this work, we relax the assumptions made in the prior work by using spectral methods. We further relax these assumptions by extending one-step proxies into the past. Finally, we derive an importance sampling algorithm which assumes rank, distinctness, and positivity conditions on certain probability matrices, and not on sufficiency conditions of observable trajectories with respect to the reward and hidden state structure required in the prior work.

Yash Nair, Nan Jiang
-
[ Visit Poster at Spot B6 in Virtual World ]

A key challenge to deploying reinforcement learning in practice is exploring safely. We propose a natural safety property---uniformly outperforming a conservative policy (adaptively estimated from all data observed thus far), up to a per-episode exploration budget. This property formalizes the idea that we should spread out exploration to avoid taking actions significantly worse than the ones that are currently known to be good. We then design an algorithm that uses a UCB reinforcement learning policy for exploration, but overrides it as needed to ensure safety with high probability. To ensure exploration across the entire state space, it adaptively determines when to explore (at different points of time across different episodes) in a way that allows "stitching" sub-episodes together to obtain a meta-episode that is equivalent to using UCB for the entire episode. Then, we establish reasonable assumptions about the underlying MDP under which our algorithm is guaranteed to achieve sublinear regret while ensuring safety; under these assumptions, the cost of imposing safety is only a constant factor.

Wanqiao Xu, Kan Xu, Hamsa Bastani, Osbert Bastani
-
[ Visit Poster at Spot C0 in Virtual World ]

We explore the use of policy approximations to reduce the computational cost of learning Nash equilibria in zero-sum stochastic games. We propose a new Q-learning type algorithm that uses a sequence of entropy-regularized soft policies to approximate the Nash policy during the Q-function updates. We prove that under certain conditions, by updating the entropy regularization, the algorithm converges to a Nash equilibrium. Numerical simulations verify that the proposed algorithm converges to Nash with a major speed-up over existing algorithms.

Yue Guan, Qifan Zhang, Panagiotis Tsiotras
-
[ Visit Poster at Spot B6 in Virtual World ]

In the past decade, contextual bandit and reinforcement learning algorithms have been successfully used in various interactive learning systems such as online advertising, recommender systems, and dynamic pricing. However, they have yet to be widely adopted in high-stakes application domains, such as healthcare. One reason may be that existing approaches assume that the underlying mechanisms are static in the sense that they do not change over different environments. In many real world systems, however, the mechanisms are subject to shifts across environments which may invalidate the static environment assumption. In this paper, we tackle the problem of environmental shifts under the framework of offline contextual bandits. We view the environmental shift problem through the lens of causality and propose multi-environment contextual bandits that allow for changes in the underlying mechanisms. We adopt the concept of invariance from the causality literature and introduce the notion of policy invariance. We argue that policy invariance is only relevant if unobserved confounders are present and show that, in that case, an optimal invariant policy is guaranteed to generalize across environments under suitable assumptions. Our results do not only provide a solution to the environmental shift problem but also establish concrete connections among causality, invariance and contextual bandits.

Sorawit Saengkyongam, Nikolaj Thams, Jonas Peters, Niklas Pfister
-
[ Visit Poster at Spot C0 in Virtual World ]

We use functional mirror ascent to propose a general framework (referred to as FMA-PG) for designing policy gradient methods. The functional perspective distinguishes between a policy's functional representation (what are its sufficient statistics) and its parameterization (how are these statistics represented), and naturally results in computationally efficient off-policy updates. For simple policy parameterizations, the FMA-PG framework ensures that the optimal policy is a fixed point of the updates. It also allows us to handle complex policy parameterizations (e.g., neural networks) while guaranteeing policy improvement. Our framework unifies several PG methods and opens the way for designing sample-efficient variants of existing methods. Moreover, it recovers important implementation heuristics (e.g., using forward vs reverse KL divergence) in a principled way. With a softmax functional representation, FMA-PG results in a variant of TRPO with additional desirable properties. It also suggests an improved variant of PPO, whose robustness we empirically demonstrate on MuJoCo.

Sharan Vaswani, Olivier Bachem, Simone Totaro, Matthieu Geist, Marlos C. Machado, Pablo Samuel Castro, Nicolas Le Roux
-
[ Visit Poster at Spot C1 in Virtual World ]
This paper initiates the theoretical study of \emph{policy finetuning} towards bridging the gap between (sample-efficient) online RL and offline RL. In this problem, the online RL learner has additional access to a ``reference policy'' $\mu$ close to the optimal policy $\pi_\star$ in a certain sense. We first design a sharp \emph{offline reduction} algorithm---which simply executes $\mu$ and runs offline policy optimization on the collected dataset---that finds an $\varepsilon$ near-optimal policy within $\widetilde{O}(H^3SC^\star/\varepsilon^2)$ episodes, where $C^\star$ is the single-policy concentrability coefficient between $\mu$ and $\pi_\star$. This offline result is the first that matches the sample complexity lower bound in this setting, and resolves a recent open question in offline RL. We then establish an $\Omega(H^3S\min\{C^\star, A\}/\varepsilon^2)$ sample complexity lower bound for \emph{any} policy finetuning algorithm, including those that can adaptively explore the environment. This implies that---perhaps surprisingly---the optimal policy finetuning algorithm is either offline reduction or a purely online RL algorithm that does not use $\mu$. Finally, we design a new hybrid offline/online algorithm for policy finetuning that achieves better sample complexity than both vanilla offline reduction and purely online RL algorithms, in a relaxed setting where $\mu$ only satisfies concentrability partially up to a certain time step. Overall, our results offer a quantitative understanding on the benefit of a good reference policy, and make a step towards bridging offline and online RL.
Tengyang Xie, Nan Jiang, Huan Wang, Caiming Xiong, Yu Bai
-
[ Visit Poster at Spot C5 in Virtual World ]
We study online control of an unknown nonlinear dynamical system that is approximated by a time-invariant linear system with model misspecification. Our study focuses on {\bf robustness}, which measures how much deviation from the assumed linear approximation can be tolerated while maintaining a bounded L2-gain. Some models cannot be stabilized even with perfect knowledge of their coefficients: the robustness is limited by the minimal distance between the assumed dynamics and the set of unstabilizable dynamics. Therefore it is necessary to assume a lower bound on this distance. Under this assumption, and with full observation of the $d$ dimensional state, we describe an efficient controller that attains $\Omega(\frac{1}{\sqrt{d}})$ robustness together with an L2-gain whose dimension dependence is near optimal. We also give an inefficient algorithm that attains constant robustness independent of the dimension, with a finite but sub-optimal L2-gain.
Xinyi Chen, Udaya Ghai, Elad Hazan, Alexandre Megretsky
-
[ Visit Poster at Spot C2 in Virtual World ]
Designing provably efficient algorithms with general function approximation is an important open problem in reinforcement learning. Recently, Wang et al.~[2020c] establish a value-based algorithm with general function approximation that enjoys $\widetilde{O}(\mathrm{poly}(dH)\sqrt{K})$\footnote{Throughout the paper, we use $\widetilde{O}(\cdot)$ to suppress logarithm factors. } regret bound, where $d$ depends on the complexity of the function class, $H$ is the planning horizon, and $K$ is the total number of episodes. However, their algorithm requires $\Omega(K)$ computation time per round, rendering the algorithm inefficient for practical use. In this paper, by applying online sub-sampling techniques, we develop an algorithm that takes $\widetilde{O}(\mathrm{poly}(dH))$ computation time per round on average, and enjoys nearly the same regret bound. Furthermore, the algorithm achieves low switching cost, i.e., it changes the policy only $\widetilde{O}(\mathrm{poly}(dH))$ times during its execution, making it appealing to be implemented in real-life scenarios. Moreover, by using an upper-confidence based exploration-driven reward function, the algorithm provably explores the environment in the reward-free setting. In particular, after $\widetilde{O}(\mathrm{poly}(dH))/\epsilon^2$ rounds of exploration, the algorithm outputs an $\epsilon$-optimal policy for any given reward function.
Dingwen Kong, Russ Salakhutdinov, Ruosong Wang, Lin Yang
-
[ Visit Poster at Spot B0 in Virtual World ]

We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a dataset collected a priori. Due to the lack of further interactions with the environment, offline RL suffers from the insufficient coverage of the dataset, which eludes most existing theoretical analysis. In this paper, we propose a \underline{pe}ssimistic variant of the \underline{v}alue \underline{i}teration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function. Such a penalty function simply flips the sign of the bonus function for promoting exploration in online RL, which makes it easily implementable and compatible with general function approximators.

Without assuming the sufficient coverage of the dataset, we establish a data-dependent upper bound on the suboptimality of PEVI for general Markov decision processes (MDPs). When specialized to linear MDPs, it matches the information-theoretic lower bound up to multiplicative factors of the dimension and horizon. In other words, pessimism is not only provably efficient but also minimax optimal. In particular, given the dataset, the learned policy serves as the best effort'' among all policies, as no other policies can do better. Our theoretical analysis identifies the critical role of pessimism in eliminating a notion of spurious correlation, which emerges from theirrelevant'' trajectories that are less covered by the dataset and not informative for the optimal policy.

Ying Jin, Zhuoran Yang, Zhaoran Wang
-
[ Visit Poster at Spot C0 in Virtual World ]

State-of-the-art deep \Q-learning methods update \Q-values using state transition tuples sampled from the experience replay buffer. Often this strategy is to randomly sample or prioritize data sampling based on measures such as the temporal difference (TD) error. Such sampling strategies are agnostic to the structure of the Markov decision process (MDP) and can therefore be data inefficient at propagating reward signals from goal states to the initial state. To accelerate reward propagation, we make use of the MDP structure by organizing the agent's experience into a graph. Each edge in the graph represents a transition between two connected states. We perform value backups via a breadth-first search that expands vertices in the graph starting from the set of terminal states successively moving backward. We empirically show that our method is substantially more data-efficient than several baselines on sparse reward tasks.

Zhang-Wei Hong, Tao Chen, Yen-Chen Lin, Joni Pajarinen, Pulkit Agrawal
-
[ Visit Poster at Spot C3 in Virtual World ]
We study the reinforcement learning problem for discounted Markov Decision Processes (MDPs) under the tabular setting. We propose a model-based algorithm named UCBVI-$\gamma$, which is based on the \emph{optimism in the face of uncertainty principle} and the Bernstein-type bonus. We show that UCBVI-$\gamma$ achieves an $\tilde{O}\big({\sqrt{SAT}}/{(1-\gamma)^{1.5}}\big)$ regret, where $S$ is the number of states, $A$ is the number of actions, $\gamma$ is the discount factor and $T$ is the number of steps. In addition, we construct a class of hard MDPs and show that for any algorithm, the expected regret is at least $\tilde{\Omega}\big({\sqrt{SAT}}/{(1-\gamma)^{1.5}}\big)$. Our upper bound matches the minimax lower bound up to logarithmic factors, which suggests that UCBVI-$\gamma$ is nearly minimax optimal for discounted MDPs.
Jiafan He, Dongruo Zhou, Quanquan Gu
-
[ Visit Poster at Spot C4 in Virtual World ]
The policy gradient (PG) is one of the most popular methods for solving reinforcement learning (RL) problems. However, a solid theoretical understanding of even the ``vanilla'' PG has remained elusive for long time. In this paper, we apply recent tools developed for the analysis of SGD in non-convex optimization to obtain convergence guarantees for both REINFORCE and GPOMDP under smoothness assumption on the objective function and weak conditions on the second moment of the norm of the estimated gradient. When instantiated under common assumptions on the policy space, our general result immediately recovers existing $O(\epsilon^{-4})$ sample complexity guarantees, but for wider ranges of parameters (e.g., step size and batch size m) w.r.t. previous literature. Notably, our result includes the single trajectory case (i.e., m=1) and it provides a more accurate analysis of the dependency on problem-specific parameters by fixing previous results available in the literature. We believe that the integration of state-of-the-art tools from non-convex optimization may lead to identify a much broader range of problems where PG methods enjoy strong theoretical guarantees.
Rui Yuan, Robert Gower, Alessandro Lazaric
-
[ Visit Poster at Spot C6 in Virtual World ]

Modern reinforcement learning (RL) commonly engages practical problems with large state spaces, where function approximation must be deployed to approximate either the value function or the policy. While recent progresses in RL theory address a rich set of RL problems with general function approximation, such successes are mostly restricted to the single-agent setting. It remains elusive how to extend these results to multi-agent RL, especially due to the new challenges arising from its game-theoretical nature. This paper considers two-player zero-sum Markov Games (MGs). We propose a new algorithm that can provably find the Nash equilibrium policy using a polynomial number of samples, for any MG with low multi-agent Bellman-Eluder dimension---a new complexity measure adapted from its single-agent version (Jin et al., 2021). A key component of our new algorithm is the exploiter, which facilitates the learning of the main player by deliberately exploiting her weakness. Our theoretical framework is generic, which applies to a wide range of models including but not limited to tabular MGs, MGs with linear or kernel function approximation, and MGs with rich observations.

Chi Jin, Qinghua Liu, Tiancheng Yu
-
[ Visit Poster at Spot B6 in Virtual World ]

Finding the minimal structural assumptions that empower sample-efficient learning is one of the most important research directions in Reinforcement Learning (RL). This paper advances our understanding of this fundamental question by introducing a new complexity measure—Bellman Eluder (BE) dimension. We show that the family of RL problems of low BE dimension is remarkably rich, which subsumes a vast majority of existing tractable RL problems including but not limited to tabular MDPs, linear MDPs, reactive POMDPs, low Bellman rank problems as well as low Eluder dimension problems. This paper further designs a new optimization-based algorithm— GOLF, and reanalyzes a hypothesis elimination-based algorithm—OLIVE (proposed in Jiang et al., 2017). We prove that both algorithms learn the near-optimal policies of low BE dimension problems in a number of samples that is polynomial in all relevant parameters, but independent of the size of state-action space. Our regret and sample complexity results match or improve the best existing results for several well-known subclasses of low BE dimension problems.

Chi Jin, Qinghua Liu, Sobhan Miryoosefi
-
[ Visit Poster at Spot B1 in Virtual World ]
While much of bandit learning focuses on minimizing regret while learning an optimal policy, it is often of interest to estimate the maximum value achievable before learning the optimal policy, which can be of use as an input to downstream tasks like model selection. Prior work in contextual bandits has considered this in the Gaussian setting. Here, we study the problem of approximating the optimal policy value in the more general linear contextual bandit problem, and we focus on whether it is possible to do so with less data than what is needed to learn the optimal policy. We consider two objectives: (1) estimating upper bounds on the value and (2) estimating the value directly. For the first, we present an adaptive upper bound that is at most logarithmic factor larger than the value and tight when the data is Gaussian and show that it is possible to estimate this upper bound in $\widetilde{\mathcal{O}}( \sqrt{d} )$ samples where $d$ is the number of parameters. As a consequence of this bound, we show improved regret bounds for model selection. For the second objective, we present a moment-based algorithm for estimating the optimal policy value with sample complexity $\widetilde{ \mathcal{O}}( \sqrt{d} )$ for sub-Gaussian context distributions whose low order moments are known.
Jonathan Lee, Weihao Kong, Aldo Pacchiano, Vidya Muthukumar, Emma Brunskill
-
[ Visit Poster at Spot C0 in Virtual World ]

Eluder dimension and information gain are two widely used methods of complexity measures in bandit and reinforcement learning. Eluder dimension was originally proposed as a general complexity measure of function classes, but the common examples of where it is known to be small are function spaces (vector spaces). In these cases, the primary tool to upper bound the eluder dimension is the elliptic potential lemma. Interestingly, the elliptic potential lemma also features prominently in the analysis of linear bandits/reinforcement learning and their nonparametric generalization, the information gain. We show that this is not a coincidence -- eluder dimension and information gain are equivalent in a precise sense for reproducing kernel Hilbert spaces.

Kaixuan Huang, Sham Kakade, Jason Lee, Qi Lei
-
[ Visit Poster at Spot C5 in Virtual World ]
Policy gradient methods have been frequently applied to problems in control and reinforcement learning with great success, yet existing convergence analysis still relies on non-intuitive, impractical and often opaque conditions. In particular, existing rates are achieved in limited settings, under strict smoothness and ergodicity conditions. In this work, we establish explicit convergence rates of policy gradient methods without relying on these conditions, instead extending the convergence regime to weakly smooth policy classes with $L_2$ integrable gradient. We provide intuitive examples to illustrate the insight behind these new conditions. We also characterize the sufficiency conditions for the ergodicity of near-linear MDPs, which represent an important class of problems. Notably, our analysis also shows that fast convergence rates are achievable for both the standard policy gradient and the natural policy gradient algorithms under these assumptions. Lastly we provide conditions and analysis for optimality of the converged policies.
Matthew Zhang, Murat Erdogdu, Animesh Garg
-
[ Visit Poster at Spot C1 in Virtual World ]
We study reinforcement learning for two-player zero-sum Markov games with simultaneous moves in the finite-horizon setting, where the transition kernel of the underlying Markov games can be parameterized by a linear function over the current state, both players' actions and the next state. In particular, we assume that we can control both players and aim to find the Nash Equilibrium by minimizing the duality gap. We propose an algorithm Nash-UCRL based on the principle ``Optimism-in-Face-of-Uncertainty''. Our algorithm only needs to find a Coarse Correlated Equilibrium (CCE), which is computationally very efficient. Specifically, we show that Nash-UCRL can provably achieve an $\tilde{O}(\sqrt{d^2 H^2 T})$ regret, where $d$ is the linear function dimension, $H$ is the length of the game and $T$ is the total number of steps in the game. To assess the optimality of our algorithm, we also prove an $\tilde{\Omega}( \sqrt{d^2 H^2 T})$ lower bound on the regret. Our upper bound matches the lower bound up to logarithmic factors, which suggests the optimality of our algorithm.
Zixiang Chen, Dongruo Zhou, Quanquan Gu
-
[ Visit Poster at Spot B1 in Virtual World ]

Off-policy policy evaluation is a fundamental problem in reinforcement learning. As a result, many estimators with different tradeoffs have been developed, however, selecting the best estimator is challenging with limited data and without additional interactive data collection. Recently, Su et al. (2020) developed a data-dependent selection procedure that competes with the oracle selection up to a constant and demonstrate its practicality. We refine the analysis to remove an extraneous assumption and improve the procedure. The improved procedure results in a tighter oracle bound and stronger empirical results on a contextual bandit task.

George Tucker
-
[ Visit Poster at Spot C1 in Virtual World ]

We study efficient algorithms for reinforcement learning in Markov decision processes, whose complexity is independent of the number of states. This formulation succinctly captures large scale problems, but is also known to be computationally hard in its general form. We consider the methodology of boosting, borrowed from supervised learning, for converting weak learners into an accurate policy. The notion of weak learning we study is that of sampled-based approximate optimization of linear functions over policies. We give an efficient algorithm that is capable of improving the accuracy of such weak learning methods, till global optimality is reached. We prove sample complexity and running time bounds on our method, that are polynomial in the natural parameters of the problem: approximation guarantee, discount factor, distribution mismatch and number of actions. In particular, our bound does not explicitly depend on the number of states.

A technical difficulty in applying previous boosting results, is that the value function over policy space is not convex. We show how to use a non-convex variant of the Frank-Wolfe method, coupled with recent advances in gradient boosting that allow incorporating a weak learner with multiplicative approximation guarantee, to overcome the non-convexity and attain global convergence.

Nataly Brukhim, Elad Hazan, Karan Singh
-
[ Visit Poster at Spot C5 in Virtual World ]

We study the problem of the design of simple economic mechanisms for assigning items to self-interested agents that combine a messaging round with a sequential-pricing stage. The rules of the sequential-pricing stage and in particular the way these rules use messages determines the way the messaging stage is used. This is a Stackelberg game where the designer is the leader and fixes the mechanism rules, inducing an equilibrium amongst agents (the followers). We model the followers through equilibrium play coming from no-regret learning, and introduce a novel single-agent Stackelberg MDP formulation, where the leader learns to effect a follower equilibrium that optimizes its objective. We solve this MDP using actor-critic methods, where the critic is given access to the joint information of all the agents.

Gianluca Brero
-
[ Visit Poster at Spot C6 in Virtual World ]

The policy improvement bound on the difference of the discounted returns plays a crucial role in the theoretical justification of the trust-region policy optimization (TRPO) algorithm. The existing bound leads to a degenerate bound when the discount factor approaches one, making the applicability of TRPO and related algorithms questionable when the discount factor is close to one. We refine the results in (Schulman et al., 2015; Achiam et al., 2017) and propose a novel bound that is ``continuous'' in the discount factor. In particular, our bound is applicable for MDPs with the long-run average rewards as well.

Mark Gluzman
-
[ Visit Poster at Spot C2 in Virtual World ]

We study meta-learning in Markov Decision Processes (MDP) with linear transition models in the undiscounted episodic setting. Under a task sharedness metric based on model proximity, we propose an algorithm that can meaningfully leverage learning in a set of sampled training tasks to quickly adapt to test tasks sampled from the same task distribution. We propose a biased version of the UC-MatrixRL algorithm~\cite{yang2019reinforcement}. The analysis leverages and extends results in the learning to learn linear regression and linear bandit setting to the more general case of MDP's with linear transition models. We study the effect of the bias on single task regret and expected regret over the task distribution. We prove that our algorithm provides significant improvements in the transfer regret for task distributions of low variance and high bias compared to learning the tasks in isolation. We outline and analyse two approaches to learn the bias.

Robert Müller, Aldo Pacchiano, Jack Parker-Holder
-
[ Visit Poster at Spot C3 in Virtual World ]
We consider the best-of-both-worlds problem for learning an episodic Markov Decision Process through $T$ episodes, with the goal of achieving $O(\sqrt{T})$ regret when the losses are adversarial and simultaneously $O(\log T)$ regret when the losses are (almost) stochastic. Recent work by [Jin and Luo, 2020] achieves this goal when the fixed transition is known, and leaves the case of unknown transition as a major open question. In this work, we resolve this open problem by using the same Follow-the-Regularized-Leader (FTRL) framework together with a set of new techniques. Specifically, we first propose a loss-shifting trick in the FTRL analysis, which greatly simplifies the approach of [Jin and Luo, 2020] and already improves their results for the known transition case. Then, we extend this idea to the unknown transition case and develop a novel analysis which upper bounds the transition estimation error by the regret itself in the stochastic setting, a key property to ensure $O(\log T)$ regret.
Tiancheng Jin, Longbo Huang, Haipeng Luo
-
[ Visit Poster at Spot D1 in Virtual World ]
Learning how to effectively control unknown dynamical systems is crucial for autonomous systems. This task becomes more challenging when the underlying dynamics are changing with time. Motivated by this challenge, this paper considers the problem of controlling an unknown Markov jump linear system (MJS) to optimize a quadratic objective. By taking a model-based perspective, we consider identification-based adaptive control for MJSs. We first provide a system identification algorithm for MJS to learn the dynamics in each mode as well as the Markov transition matrix, underlying the evolution of the mode switches, from a single trajectory of the system states, inputs, and modes. Through mixing-time arguments, sample complexity of this algorithm is shown to be $\tilde{\mathcal{O}}(1/\sqrt{T})$. We then propose an adaptive control scheme that performs system identification together with certainty equivalent control to adapt the controllers in an episodic fashion. Combining our sample complexity results with recent perturbation results for certainty equivalent control, we prove that the proposed adaptive control scheme achieves $\tilde{\mathcal{O}}(\sqrt{T})$ regret, which can be improved to $\hat{\mathcal{O}}(\log(T))$ with partial knowledge of the system. Our analysis introduces innovations to handle MJS specific challenges (e.g. Markovian jumps) and provides insights into system theoretic quantities that affect learning accuracy and control performance.
Yahya Sattar, Zhe Du, Davoud Ataee Tarzanagh, Necmiye Ozay, Laura Balzano, Samet Oymak
-
[ Visit Poster at Spot B2 in Virtual World ]

Most of the existing theoretical studies on representation learning are focused on batch tasks. However, in practical decision-making scenarios, the learner often observes tasks in a sequential fashion. In such sequential problems, learning good representations becomes more challenging as the underlying task representation may change over time. In this paper, we address non-stationary representation learning in sequential multi-armed linear bandits. We introduce an online algorithm that is able to detect task switches and learn and transfer a non-stationary representation in an adaptive fashion. We derive a regret upper bound for our algorithm, which significantly outperforms the existing ones that do not learn the representation. Our bound provides theoretical insights into problem-dependent quantities and reveals the excess regret incurred by representation learning, non-stationarity, and task switch detection.

Qin Yuzhen, Tommaso Menara, Samet Oymak, ShiNung Ching, Fabio Pasqualetti
-
[ Visit Poster at Spot C4 in Virtual World ]

While deep reinforcement learning (RL) methods present an appealing approach to sequential decision-making, such methods are often unstable in practice. What accounts for this instability? Recent theoretical analysis of overparameterized supervised learning with stochastic gradient descent shows that learning is driven by an implicit regularizer once it reaches the zero training error regime, which results in “simpler” functions that generalize. However, in this paper, we show that in the case of deep RL, implicit regularization can instead lead to degenerate features by characterizing the induced implicit regularizer in the semi-gradient TD learning setting in RL with a fixed-dataset (i.e., offline RL). The regularize recapitulates recent empirical findings regarding the rank collapse of learned features and provides an understanding for its cause. To address the adverse impacts of this implicit regularization, we propose a simple and effective explicit regularizer for TD learning, DR3. DR3 minimizes the similarity of learned features of the Q-network at consecutive state-action tuples in the TD update. Empirically, when combined with existing offline RL methods, DR3 substantially improves both performance and stability on Atari 2600 games, D4RL domains, and robotic manipulation from images.

Aviral Kumar, Rishabh Agarwal, Aaron Courville, Tengyu Ma, George Tucker, Sergey Levine
-
[ Visit Poster at Spot C5 in Virtual World ]
Policy optimization is a widely-used method in reinforcement learning. Due to its local-search nature, however, theoretical guarantees on global optimality often rely on extra assumptions on the Markov Decision Processes (MDPs) that bypass the challenge of global exploration. To eliminate the need of such assumptions, in this work, we develop a general solution that adds \textit{dilated bonuses} to the policy update to facilitate global exploration. To showcase the power and generality of this technique, we apply it to several episodic MDP settings with adversarial losses and bandit feedback, improving and generalizing the state-of-the-art. Specifically, in the tabular case, we obtain $O(\sqrt{T})$ regret where $T$ is the number of episodes, improving the $O({T}^{\frac{2}{3}})$ regret bound by Shani et al. (2020). When the number of states is infinite, under the assumption that the state-action values are linear in some low-dimensional features, we obtain $O({T}^{\frac{2}{3}})$ regret with the help of a simulator, matching the result of Neu and Olkhovskaya (2020) while importantly removing the need of an exploratory policy that their algorithm requires. When a simulator is unavailable, we further consider a linear MDP setting and obtain $O({T}^{\frac{14}{15}})$ regret, which is the first result for linear MDPs with adversarial losses and bandit feedback.
Haipeng Luo, Chen-Yu Wei, Chung-Wei Lee
-
[ Visit Poster at Spot D0 in Virtual World ]
In this work we study the sample complexity for solving average-reward Markov decision processes (AMDPs), under a generative model access and mixing time bound on all stationary policies. Given an AMDP with mixing time bound $t_{mix}$ and $A_{tot}$ total state-action pairs, we present two methods for finding approximately-optimal stationary policy, altogether obtaining an upper bound of $\widetilde{O}(A_{tot}t_{mix}/\epsilon^2\cdot\min(t_{mix}, 1/\epsilon))$ in sample complexity. We also provide a sample complexity lower bound of $\Omega(A_{tot}t_{mix}/\epsilon^2)$ oblivious samples. This work makes progress toward designing new algorithms with better sample complexity for solving AMDPs and points to the final open problem of closing the gap with the lower bound.
Yujia Jin
-
[ Visit Poster at Spot D1 in Virtual World ]

In this paper, we study the finite-time behaviour of temporal difference (TD) learning algorithms when combined with tail-averaging, and present instance dependent bounds on the parameter error of the tail-averaged TD iterate. Our error bounds hold in expectation as well as with high probability, exhibit a sharper rate of decay for the initial error (bias), and are comparable with existing bounds in the literature.

Gandharv Patil, Prashanth L.A., Doina Precup
-
[ Visit Poster at Spot C6 in Virtual World ]

Offline reinforcement learning (RL) algorithms have shown promising results in domains where abundant pre-collected data is available. However, prior methods focus on solving individual problems from scratch with an offline dataset without considering how an offline RL agent can acquire multiple skills. We argue that a natural use case of offline RL is in settings where we can pool large amounts of data collected in a number of different scenarios for solving various tasks, and utilize all this data to learn strategies for all the tasks more effectively rather than training each one in isolation. To this end, we study the offline multi-task RL problem, with the goal of devising data-sharing strategies for effectively learning behaviors across all of the tasks. While it is possible to share all data across all tasks, we find that this simple strategy can actually exacerbate the distributional shift between the learned policy and the dataset, which in turn can lead to very poor performance. To address this challenge, we develop a simple technique for data-sharing in multi-task offline RL that routes data based on the improvement over the task-specific data. We call this approach conservative data sharing (CDS), and it can be applied with any single-task offline RL method. On a range of challenging multi-task locomotion, navigation, and image-based robotic manipulation problems, CDS achieves the best or comparable performance compared to prior offline multi-task RL methods and previously proposed online multi-task data sharing approaches.

Tianhe (Kevin) Yu, Aviral Kumar, Yevgen Chebotar, Karol Hausman, Sergey Levine, Chelsea Finn
-
[ Visit Poster at Spot D0 in Virtual World ]

We study multi-task reinforcement learning (RL) in tabular episodic Markov decision processes(MDPs). We formulate a heterogeneous multi-player RL problem, in which a group of players concurrently face similar but not necessarily identical MDPs, with a goal of improving their collective performance through inter-player information sharing. We design and analyze a model-based algorithm, and provide gap-dependent and gap-independent upper and lower bounds that characterize the intrinsic complexity of the problem.

Chicheng Zhang, Zhi Wang
-

Reinforcement learning is hard in general. Yet, in many specific environments, learning is easy. What makes learning easy in one environment, but difficult in another? We address this question by proposing a simple measure of reinforcement learning hardness called the bad-policy density. This quantity measures the fraction of the deterministic stationary policy space that is below a desired threshold in value. We prove that this simple quantity has many properties one would expect of a measure of learning hardness. Further, we prove it is NP-hard to compute the measure in general, but there are paths to polynomial-time approximation. We conclude by summarizing potential directions and uses for this measure.

Dave Abel, Cameron Allen, Dilip Arumugam, D Ellis Hershkowitz, Michael L. Littman, Lawson Wong
-
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and meanwhile avoids violation of certain constraints on a number of expected total costs. In general, such SRL problems have nonconvex objective functions subject to multiple nonconvex constraints, and hence are very challenging to solve, particularly to provide a globally optimal policy. Many popular SRL algorithms adopt a primal-dual structure which utilizes the updating of dual variables for satisfying the constraints. In contrast, we propose a primal approach, called constraint-rectified policy optimization (CRPO), which updates the policy alternatingly between objective improvement and constraint satisfaction. CRPO provides a primal-type algorithmic framework to solve SRL problems, where each policy update can take any variant of policy optimization step. To demonstrate the theoretical performance of CRPO, we adopt natural policy gradient (NPG) for each policy update step and show that CRPO achieves an $\mathcal{O}(1/\sqrt{T})$ convergence rate to the global optimal policy in the constrained policy set and an $\mathcal{O}(1/\sqrt{T})$ error bound on constraint satisfaction. This is the first finite-time analysis of primal SRL algorithms with global optimality guarantee. Our empirical results demonstrate that CRPO can outperform the existing primal-dual baseline algorithms significantly.
Tengyu Xu, Yingbin LIANG, Guanghui Lan
-

A fundamental concept in control theory is that of controllability, where any system state can be reached through an appropriate choice of control inputs. Indeed, a large body of classical and modern approaches are designed for controllable linear dynamical systems. However, in practice, we often encounter systems in which a large set of state variables evolve exogenously and independently of the control inputs; such systems are only \emph{partially controllable}. The focus of this work is on a large class of partially controllable linear dynamical system, specified by an underlying sparsity pattern. Our main results establish structural conditions and finite-sample guarantees for learning to control such systems. In particular, our structural results characterize those state variables which are irrelevant for optimal control, an analysis which departs from classical control techniques. Our algorithmic results adapt techniques from high-dimensional statistics---specifically soft-thresholding and semiparametric least-squares---to exploit the underlying sparsity pattern in order to obtain finite-sample guarantees that significantly improve over those based on certainty-equivalence. We also corroborate these theoretical improvements over certainty-equivalent control through a simulation study.

Yonathan Efroni, Sham Kakade, Akshay Krishnamurthy, Cyril Zhang
-

Actor-critic methods are widely used in offline reinforcement learning practice but are understudied theoretically. In this work we show that the pessimism principle can be naturally incorporated into actor-critic formulations. We create an offline actor-critic algorithm for a linear MDP model more general than the low-rank model. The procedure is both minimax optimal and computationally tractable.

Andrea Zanette, Martin Wainwright, Emma Brunskill
-
The multi-armed bandit (MAB) problem is an active learning framework that aims to select the best among a set of actions by sequentially observing rewards. Recently, it has become popular for a number of applications over wireless networks, where communication constraints can form a bottleneck. Yet existing works usually fail to address this issue and can become infeasible in certain applications. In this paper, we propose QuBan, a generic reward quantization algorithm that applies to any (no-regret) multi-armed bandit algorithm. The modified algorithm requires on average a few (as low as $3$) bits to be sent per iteration, yet preserving the same regret as the original algorithm. Our upper bounds apply under mild assumptions on the reward distributions over all current (and future) MAB algorithms, including those used in contextual bandits. We also numerically evaluate the application of QuBan to widely used algorithms such as UCB and $\epsilon$-greedy.
Osama Hanna, Lin Yang, Christina Fragouli
-

We introduce a generic template for developing regret minimization algorithms in the Stochastic Shortest Path (SSP) model, which achieves minimax optimal regret as long as certain properties are ensured. The key of our analysis is a new technique called implicit finite-horizon approximation, which approximates the SSP model by a finite-horizon counterpart only in the analysis without explicit implementation. Using this template, we develop two new algorithms: the first one is model-free (the first in the literature to our knowledge) and minimax optimal under strictly positive costs; the second one is model-based and minimax optimal even with zero-cost state-action pairs, matching the best existing result from (Tarbouriech et al., 2021b). Importantly, both algorithms admit highly sparse updates, making them computationally more efficient than all existing algorithms. Moreover, both can be made completely parameter-free.

Liyu Chen, Mehdi Jafarnia, Rahul Jain, Haipeng Luo
-

We study a theory of reinforcement learning (RL) in which the learner receives binary feedback only once at the end of an episode. While this is an extreme test case for theory, it is also arguably more representative of real-world applications than the traditional requirement in RL practice that the learner receive feedback at every time step. Indeed, in many real-world applications of reinforcement learning, such as self-driving cars and robotics, it is easier to evaluate whether a learner's complete trajectory was either "good" or "bad," but harder to provide a reward signal at each step. To show that learning is possible in this more challenging setting, we study the case where trajectory labels are generated by an unknown parametric model, and provide a statistically and computationally efficient algorithm that achieves sub-linear regret.

Niladri Chatterji, Aldo Pacchiano, Peter Bartlett, Michael Jordan
-

Real world applications such as economics and policy making often involve solving multi-agent games with two unique features: (1) The agents are inherently \emph{asymmetric} and partitioned into leaders and followers; (2) The agents have different reward functions, thus the game is \emph{general-sum}. The majority of existing results in this field focuses on either symmetric solution concepts (e.g. Nash equilibrium) or zero-sum games. It remains vastly open how to learn the \emph{Stackelberg equilibrium}---an asymmetric analog of the Nash equilibrium---in general-sum games efficiently from samples.

This paper initiates the theoretical study of sample-efficient learning of the Stackelberg equilibrium, in the bandit feedback setting where we only observe noisy samples of the reward. We consider three representative two-player general-sum games: bandit games, bandit-reinforcement learning (bandit-RL) games, and linear bandit games. In all these games, we identify a fundamental gap between the exact value of the Stackelberg equilibrium and its estimated version using finitely many noisy samples, which can not be closed information-theoretically regardless of the algorithm. We then establish sharp positive results on sample-efficient learning of Stackelberg equilibrium with value optimal up to the gap identified above, with matching lower bounds in the dependency on the gap, error tolerance, and the size of the action spaces. Overall, our results unveil unique challenges in learning Stackelberg equilibria under noisy bandit feedback, which we hope could shed light on future research on this topic.

Yu Bai, Chi Jin, Huan Wang, Caiming Xiong

Author Information

Shipra Agrawal (Columbia University)
Simon Du (University of Washington)
Niao He (ETH Zurich)
Csaba Szepesvari
Lin Yang (UCLA)

More from the Same Authors