While over many years we have witnessed numerous impressive demonstrations of the power of various reinforcement learning (RL) algorithms, and while much progress was made on the theoretical side as well, the theoretical understanding of the challenges that underlie RL is still rather limited. The beststudied problem settings, such as learning and acting in finite stateaction Markov decision processes, or simple linear control systems fail to capture the essential characteristics of seemingly more practically relevant problem classes, where the size of the stateaction space is often astronomical, the planning horizon is huge, the dynamics is complex, interaction with the controlled system is not permitted, or learning has to happen based on heterogeneous offline data, etc. To tackle these diverse issues, more and more theoreticians with a wide range of backgrounds came to study RL and have proposed numerous new models along with exciting novel developments on both algorithm design and analysis. The workshop's goal is to highlight advances in theoretical RL and bring together researchers from different backgrounds to discuss RL theory from different perspectives: modeling, algorithm, analysis, etc.
Sat 9:00 a.m.  9:25 a.m.

Invited Speaker: Emilie Kaufmann: On pureexploration in Markov Decision Processes
(Presentation)
SlidesLive Video » This talk will be about some recent works on the sample complexity of pureexploration in an episodic MDP, when we sequentially collect trajectories, with or without rewards, and aim at identifying good policies. In particular, I will describe an optimal algorithm for rewardfree exploration called RFExpress, which interestingly features some exploration bonuses that scale in 1/n instead of the usual 1/\sqrt{n}. 
Emilie Kaufmann 🔗 
Sat 9:30 a.m.  9:55 a.m.

Invited Speaker: Christian Kroer: Recent Advances in Iterative Methods for LargeScale Game Solving
(Presentation)
link »
SlidesLive Video » Iterative methods for approximating zerosum Nash equilibria in extensiveform games have been a core component of recent advances in superhuman poker AIs. In this talk, I will first give an optimizationoriented description of how these methods work. Then, I will discuss and contrast two recent results: First, the development of a new entropybased regularization method for the decision spaces associated with extensiveform games, which is simultaneously simpler to analyze and has better theoretical properties than the current state of the art. Second, I will discuss new algorithms based on optimistic variants of regret matching and CFR, which lead to very strong practical performance, in spite of inferior theoretical guarantees. This talk is based on joint work with Gabriele Farina and Tuomas Sandholm. 
Christian Kroer 🔗 
Sat 10:00 a.m.  10:12 a.m.

Sparsity in the Partially Controllable LQR
(Poster & Spotlight Talk)
SlidesLive Video » A fundamental concept in control theory is that of controllability, where any system state can be reached through an appropriate choice of control inputs. Indeed, a large body of classical and modern approaches are designed for controllable linear dynamical systems. However, in practice, we often encounter systems in which a large set of state variables evolve exogenously and independently of the control inputs; such systems are only \emph{partially controllable}. The focus of this work is on a large class of partially controllable linear dynamical system, specified by an underlying sparsity pattern. Our main results establish structural conditions and finitesample guarantees for learning to control such systems. In particular, our structural results characterize those state variables which are irrelevant for optimal control, an analysis which departs from classical control techniques. Our algorithmic results adapt techniques from highdimensional statisticsspecifically softthresholding and semiparametric leastsquaresto exploit the underlying sparsity pattern in order to obtain finitesample guarantees that significantly improve over those based on certaintyequivalence. We also corroborate these theoretical improvements over certaintyequivalent control through a simulation study. 
Yonathan Efroni · Sham Kakade · Akshay Krishnamurthy · Cyril Zhang 🔗 
Sat 10:15 a.m.  10:27 a.m.

On the Theory of Reinforcement Learning with OnceperEpisode Feedback
(Poster & Spotlight Talk)
SlidesLive Video » We study a theory of reinforcement learning (RL) in which the learner receives binary feedback only once at the end of an episode. While this is an extreme test case for theory, it is also arguably more representative of realworld applications than the traditional requirement in RL practice that the learner receive feedback at every time step. Indeed, in many realworld applications of reinforcement learning, such as selfdriving cars and robotics, it is easier to evaluate whether a learner's complete trajectory was either "good" or "bad," but harder to provide a reward signal at each step. To show that learning is possible in this more challenging setting, we study the case where trajectory labels are generated by an unknown parametric model, and provide a statistically and computationally efficient algorithm that achieves sublinear regret. 
Niladri Chatterji · Aldo Pacchiano · Peter Bartlett · Michael Jordan 🔗 
Sat 10:30 a.m.  10:42 a.m.

Implicit FiniteHorizon Approximation for Stochastic Shortest Path
(Poster & Spotlight Talk)
SlidesLive Video » We introduce a generic template for developing regret minimization algorithms in the Stochastic Shortest Path (SSP) model, which achieves minimax optimal regret as long as certain properties are ensured. The key of our analysis is a new technique called implicit finitehorizon approximation, which approximates the SSP model by a finitehorizon counterpart only in the analysis without explicit implementation. Using this template, we develop two new algorithms: the first one is modelfree (the first in the literature to our knowledge) and minimax optimal under strictly positive costs; the second one is modelbased and minimax optimal even with zerocost stateaction pairs, matching the best existing result from (Tarbouriech et al., 2021b). Importantly, both algorithms admit highly sparse updates, making them computationally more efficient than all existing algorithms. Moreover, both can be made completely parameterfree. 
Liyu Chen · Mehdi Jafarnia · Rahul Jain · Haipeng Luo 🔗 
Sat 10:45 a.m.  10:57 a.m.

Provable Benefits of ActorCritic Methods for Offline Reinforcement Learning
(Poster & Spotlight Talk)
SlidesLive Video » Actorcritic methods are widely used in offline reinforcement learning practice but are understudied theoretically. In this work we show that the pessimism principle can be naturally incorporated into actorcritic formulations. We create an offline actorcritic algorithm for a linear MDP model more general than the lowrank model. The procedure is both minimax optimal and computationally tractable. 
Andrea Zanette · Martin Wainwright · Emma Brunskill 🔗 
Sat 11:00 a.m.  11:25 a.m.

Invited Speaker: Animashree Anandkumar: Stabilityaware reinforcement learning in dynamical systems
(Presentation)
SlidesLive Video » TBD 
Animashree Anandkumar 🔗 
Sat 11:30 a.m.  11:55 a.m.

Invited Speaker: Shie Mannor: Lenient Regret
(Presentation)
SlidesLive Video » We consider the MultiArmed Bandit (MAB) problem, where an agent sequentially chooses actions and observes rewards for the actions it took. While the majority of algorithms try to minimize the regret, i.e., the cumulative difference between the reward of the best action and the agent's action, this criterion might lead to undesirable results. For example, in large problems, or when the interaction with the environment is brief, finding an optimal arm is infeasible, and regretminimizing algorithms tend to overexplore. To overcome this issue, algorithms for such settings should instead focus on playing nearoptimal arms. To this end, we suggest a new, more lenient, regret criterion that ignores suboptimality gaps smaller than some ϵ. We then present a variant of the Thompson Sampling (TS) algorithm, called ϵTS, and prove its asymptotic optimality in terms of the lenient regret. Importantly, we show that when the mean of the optimal arm is high enough, the lenient regret of ϵTS is bounded by a constant. Finally, we show that ϵTS can be applied to improve the performance when the agent knows a lower bound of the suboptimality gaps. 
Shie Mannor 🔗 
Sat 12:00 p.m.  12:30 p.m.

Social Session
(Discussion & Chat)

🔗 
Sat 12:30 p.m.  2:00 p.m.

Poster Session  I
(Poster Session)

🔗 
Sat 2:00 p.m.  2:25 p.m.

Invited Speaker: Bo Dai: Leveraging Nonuniformity in Policy Gradient
(Presentation)
SlidesLive Video » Policy gradient is one of the stateoftheart algorithm families in reinforcement learning, which has been proved to be globally convergent. Motivated by properties of the accumulated reward in MDP, we propose a nonuniform refinement of the smoothness (NS) and \L{}ojasiewicz condition (N\L{}). The new definitions inspire new geometryaware firstorder policy gradient that are able to converge to global optimality in linear rate while incurring less overhead than existing algorithms, e.g., natural/mirror policy gradient. Similarly, For GLM, we show that geometryaware normalized gradient descent can also achieve a linear convergence rate in fitting generalized linear models. Experimental results are used to illustrate and complement the theoretical findings. 
Bo Dai 🔗 
Sat 2:30 p.m.  2:55 p.m.

Invited Speaker: Qiaomin Xie: Reinforcement Learning for ZeroSum Markov Games Using Function Approximation and Correlated Equilibrium
(Presentation)
SlidesLive Video » We consider onpolicy reinforcement learning for twoplayer zerosum Markov games with simultaneous moves, which generalizes both matrix games and Markov decision processes. To ensure exploration, our algorithm needs to construct upper and lower confidence bounds for both players in each round. We show that this can be done by finding the Coarse Correlated Equilibrium of an appropriate generalsum matrix game, which can be solved in polynomial time. The algorithm applies to the offline setting where one aims to find the Nash Equilibria of the Markov game, and the online setting where one plays against an arbitrary opponent and aims to minimize regret. For both settings, we provide sample complexity and regret guarantees. Our results generalize to Markov games with continuous and unbounded state spaces. Using linear function approximation, we develop an optimistic version of leastsquares minimax value iteration, for which we provide performance guarantees that only depend on the linear dimension. This setting highlights the analytical challenges that arise due to the nonLipschitzness of CCEs of generalsum games. Based on https://arxiv.org/abs/2002.07066 . 
Qiaomin Xie 🔗 
Sat 3:00 p.m.  3:12 p.m.

BadPolicy Density: A Measure of ReinforcementLearning Hardness
(Poster & Spotlight Talk)
SlidesLive Video » Reinforcement learning is hard in general. Yet, in many specific environments, learning is easy. What makes learning easy in one environment, but difficult in another? We address this question by proposing a simple measure of reinforcement learning hardness called the badpolicy density. This quantity measures the fraction of the deterministic stationary policy space that is below a desired threshold in value. We prove that this simple quantity has many properties one would expect of a measure of learning hardness. Further, we prove it is NPhard to compute the measure in general, but there are paths to polynomialtime approximation. We conclude by summarizing potential directions and uses for this measure. 
David Abel · Cameron Allen · Dilip Arumugam · D Ellis Hershkowitz · Michael L. Littman · Lawson Wong 🔗 
Sat 3:15 p.m.  3:27 p.m.

SampleEfficient Learning of Stackelberg Equilibria in GeneralSum Games
(Poster & Spotlight Talk)
SlidesLive Video » Real world applications such as economics and policy making often involve solving multiagent games with two unique features: (1) The agents are inherently \emph{asymmetric} and partitioned into leaders and followers; (2) The agents have different reward functions, thus the game is \emph{generalsum}. The majority of existing results in this field focuses on either symmetric solution concepts (e.g. Nash equilibrium) or zerosum games. It remains vastly open how to learn the \emph{Stackelberg equilibrium}an asymmetric analog of the Nash equilibriumin generalsum games efficiently from samples. This paper initiates the theoretical study of sampleefficient learning of the Stackelberg equilibrium, in the bandit feedback setting where we only observe noisy samples of the reward. We consider three representative twoplayer generalsum games: bandit games, banditreinforcement learning (banditRL) games, and linear bandit games. In all these games, we identify a fundamental gap between the exact value of the Stackelberg equilibrium and its estimated version using finitely many noisy samples, which can not be closed informationtheoretically regardless of the algorithm. We then establish sharp positive results on sampleefficient learning of Stackelberg equilibrium with value optimal up to the gap identified above, with matching lower bounds in the dependency on the gap, error tolerance, and the size of the action spaces. Overall, our results unveil unique challenges in learning Stackelberg equilibria under noisy bandit feedback, which we hope could shed light on future research on this topic. 
Yu Bai · Chi Jin · Huan Wang · Caiming Xiong 🔗 
Sat 3:30 p.m.  3:42 p.m.

Solving MultiArm Bandit Using a Few Bits of Communication
(Poster & Spotlight Talk)
SlidesLive Video »
The multiarmed bandit (MAB) problem is an active learning framework that aims to select the best among a set of actions by sequentially observing rewards. Recently, it has become popular for a number of applications over wireless networks, where communication constraints can form a bottleneck. Yet existing works usually fail to address this issue and can become infeasible in certain applications.
In this paper, we propose QuBan, a generic reward quantization algorithm that applies to any (noregret) multiarmed bandit algorithm. The modified algorithm requires on average a few (as low as $3$) bits to be sent per iteration, yet preserving the same regret as the original algorithm.
Our upper bounds apply under mild assumptions on the reward distributions over all current (and future) MAB algorithms, including those used in contextual bandits.
We also numerically evaluate the application of QuBan to widely used algorithms such as UCB and $\epsilon$greedy.

Osama Hanna · Lin Yang · Christina Fragouli 🔗 
Sat 3:45 p.m.  3:57 p.m.

CRPO: A New Approach for Safe Reinforcement Learning with Convergence Guarantee
(Poster & Spotlight Talk)
SlidesLive Video »
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and meanwhile avoids violation of certain constraints on a number of expected total costs. In general, such SRL problems have nonconvex objective functions subject to multiple nonconvex constraints, and hence are very challenging to solve, particularly to provide a globally optimal policy. Many popular SRL algorithms adopt a primaldual structure which utilizes the updating of dual variables for satisfying the constraints. In contrast, we propose a primal approach, called constraintrectified policy optimization (CRPO), which updates the policy alternatingly between objective improvement and constraint satisfaction. CRPO provides a primaltype algorithmic framework to solve SRL problems, where each policy update can take any variant of policy optimization step. To demonstrate the theoretical performance of CRPO, we adopt natural policy gradient (NPG) for each policy update step and show that CRPO achieves an $\mathcal{O}(1/\sqrt{T})$ convergence rate to the global optimal policy in the constrained policy set and an $\mathcal{O}(1/\sqrt{T})$ error bound on constraint satisfaction. This is the first finitetime analysis of primal SRL algorithms with global optimality guarantee. Our empirical results demonstrate that CRPO can outperform the existing primaldual baseline algorithms significantly.

Tengyu Xu · Yingbin LIANG · Guanghui Lan 🔗 
Sat 4:00 p.m.  4:25 p.m.

Invited Speaker: Art Owen: Empirical likelihood for reinforcement learning
(Presentation)
link »
SlidesLive Video » Empirical likelihood (EL) is a statistical method that uses a nonparametric likelihood function. It allows one to construct confidence intervals without having to specify that the data come from some parametric distribution such as a Gaussian. Operationally, EL involves a strategic reweighting of observed data to attain its goals. This makes it similar to importance sampling and self normalized importance sampling, both widely used for off policy evaluation in reinforcement learning (RL). Recently EL has been used in off policy evaluation and in distributionally robust inference. This talk gives the basic motivation and some results about EL thought to be useful for RL: (a) EL inferences can be as or more powerful than their corresponding parametric counterparts, depending on how one keeps score. (b) There is a natural way to incorporate sampling bias via reweighting. (c) One can exploit side knowledge expresed as some known expected values. (d) An empirical likelihood can be paired with a prior distribution to get Bayesian inferences on quantities of interest without having to choose a parameteric distribution. 
🔗 
Sat 4:30 p.m.  5:00 p.m.

Panel Session: Animashree Anandkumar, Christian Kroer, Art Owen, Qiaomin Xie
(Discussion Panel)
SlidesLive Video » 
🔗 
Sat 5:00 p.m.  5:30 p.m.

Social Session
(Discussion & Chat)

🔗 
Sat 5:30 p.m.  9:00 p.m.

Poster Session  II
(Poster Session)

🔗 


Finding the Near Optimal Policy via Reductive Regularization in MDPs
(Poster)
Regularized Markov Decision processes (MDPs) serve as a smooth version of ordinary MDPs to encourage exploration. Given a regularized MDP, however, the optimal policy is often biased when evaluating the value function. Rather than making the coefficient $\lambda$ of regularized term sufficiently small, we propose a scheme by reducing $\lambda$ to approximate the optimal policy of the original MDP. We prove that the iteration complexity to obtain an $\varepsilon$optimal policy could be maintained or even reduced in comparison with setting a sufficiently small $\lambda$ in both dynamic programming and policy gradient methods. In addition, there exists a strong duality connection between the reduction method and solving the original MDP directly, from which we can derive more adaptive reduction methods for certain reinforcement learning algorithms.

Wenhao Yang · Xiang Li · Guangzeng Xie · Zhihua Zhang 🔗 


Finite Sample Analysis of AverageReward TD Learning and $Q$Learning
(Poster)
The focus of this paper is on sample complexity guarantees of averagereward reinforcement learning algorithms, which are known to be more challenging to study than their discountedreward counterparts. To the best of our knowledge, we provide the first known finite sample guarantees using both constant and diminishing step sizes of (i) averagereward TD($\lambda$) with linear function approximation for policy evaluation and (ii) averagereward $Q$learning in the tabular setting to find an optimal policy. A major challenge is that since the value functions are agnostic to an additive constant, the corresponding Bellman operators are no longer contraction mappings under any norm.
We obtain the results for TD($\lambda$) by working in an appropriately defined subspace that ensures uniqueness of the solution. For $Q$learning, we exploit the span seminorm contractive property of the Bellman operator and construct a novel Lyapunov function obtained by infimal convolution of the generalized Moreau envelope and the indicator function of a set.

Sheng Zhang · Zhe Zhang · Siva Maguluri 🔗 


Sample Complexity of Offline Reinforcement Learning with Deep ReLU Networks
(Poster)
We study the statistical theory of offline reinforcement learning (RL) with deep ReLU network function approximation. We analyze a variant of fittedQ iteration (FQI) algorithm under a new dynamic condition that we call Besov dynamic closure, which encompasses the conditions from prior analyses for deep neural network function approximation. Under Besov dynamic closure, we prove that the FQItype algorithm enjoys an improved sample complexity than the prior results. Importantly, our sample complexity is obtained under the new general dynamic condition and a datadependent structure where the latter is either ignored in prior algorithms or improperly handled by prior analyses. This is the first comprehensive analysis for offline RL with deep ReLU network function approximation under a general setting. 
Thanh NguyenTang · Sunil Gupta · Hung TranThe · Svetha Venkatesh 🔗 


TripleQ: A ModelFree Algorithm for Constrained Reinforcement Learning with Sublinear Regret and Zero Constraint Violation
(Poster)
This paper presents the first {\em modelfree}, {\em simulatorfree} reinforcement learning algorithm for Constrained Markov Decision Processes (CMDPs) with sublinear regret and zero constraint violation. The algorithm is named {\em TripleQ} because it has three key components: a Qfunction (also called actionvalue function) for the cumulative reward, a Qfunction for the cumulative utility for the constraint, and a virtualQueue that (over)estimates the cumulative constraint violation. Under TripleQ, at each step, an action is chosen based on the pseudoQvalue that is a combination of the three ``Q'' values. The algorithm updates the reward and utility Qvalues with learning rates that depend on the visit counts to the corresponding (state, action) pairs and are periodically reset. In the episodic CMDP setting, TripleQ achieves $\tilde{\cal O}\left(\frac{1 }{\delta}H^4 S^{\frac{1}{2}}A^{\frac{1}{2}}K^{\frac{4}{5}} \right)$ regret, where $K$ is the total number of episodes, $H$ is the number of steps in each episode, $S$ is the number of states, $A$ is the number of actions, and $\delta$ is Slater's constant. Furthermore, {TripleQ} guarantees zero constraint violation when $K$ is sufficiently large. Finally, the computational complexity of {TripleQ} is similar to SARSA for unconstrained MDPs, and is computationally efficient.

Honghao Wei · Xin Liu · Lei Ying 🔗 


Subgaussian Importance Sampling for OffPolicy Evaluation and Learning
(Poster)
Importance Sampling (IS) is a widely used building block for a large variety of offpolicy estimation and learning algorithms. However, empirical and theoretical studies have progressively shown that vanilla IS leads to poor estimations whenever the behavioral and target policies are too dissimilar. In this paper, we analyze the theoretical properties of the IS estimator by deriving a probabilistic deviation lower bound that formalizes the intuition behind its undesired behavior. Then, we propose a class of IS transformations, based on the notion of power mean, that are able, under certain circumstances, to achieve a subgaussian concentration rate. Differently from existing methods, like weight truncation, our estimator preserves the differentiability in the target distribution. 
Alberto Maria Metelli · Alessio Russo · Marcello Restelli 🔗 


Minimax Regret for Stochastic Shortest Path
(Poster)
We study the Stochastic Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost. In the learning formulation of the problem, the agent has no prior knowledge about the costs and dynamics of the model. She repeatedly interacts with the model for $K$ episodes, and has to minimize her regret. In this work we show that the minimax regret for this setting is $\widetilde O(\sqrt{ (B_\star^2 + B_\star) S A K})$ where $B_\star$ is a bound on the expected cost of the optimal policy from any state, $S$ is the state space, and $A$ is the action space. This matches the $\Omega (\sqrt{ B_\star^2 S A K})$ lower bound of Rosenberg et al. (2020) for $B_\star \ge 1$, and improves their regret bound by a factor of $\sqrt{S}$. For $B_\star < 1$ we prove a matching lower bound of $\Omega (\sqrt{ B_\star S A K})$. Our algorithm is based on a novel reduction from SSP to finitehorizon MDPs. To that end, we provide an algorithm for the finitehorizon setting whose leading term in the regret depends polynomially on the expected cost of the optimal policy and only logarithmically on the horizon.

Alon Cohen · Yonathan Efroni · Yishay Mansour · Aviv Rosenberg 🔗 


Collision Resolution in Multiplayer Bandits Without Observing Collision Information
(Poster)
The absence of collision information in Multiplayer Multiarmed bandits (MMABs) renders arm availabilities partially observable, impeding the design of algorithms with regret guarantees that do not allow interplayer communication. In this work, we propose a collision resolution (CR) mechanism for MMABs inspired from sequential interference mechanisms employed in communication protocols. In the general case, our collision resolution mechanism assumes that players can pull multiple arms during the exploration phase. We, thus, propose a novel MMAB model that captures this while still considering strictly bandit feedback and singlepulls during the exploitation phase. We theoretically analyze the CR mechanism using tools from information theory in order to prove the existence of an upper bound on the probability of its failure that decreases at a rate exponential in the number of players. 
Eleni Nisioti · Nikolaos Thomos · Boris Bellalta · Anders Jonsson 🔗 


Marginalized Operators for OffPolicy Reinforcement Learning
(Poster)
In this work, we propose marginalized operators, a new class of offpolicy evaluation operators for reinforcement learning. Marginalized operators strictly generalize generic multistep operators, such as Retrace, as special cases. Marginalized operators also suggest a form of samplebased estimates with potential variance reduction, compared to samplebased estimates of the original multistep operators. We show that the estimates for marginalized operators can be computed in a scalable way, which also generalizes prior results on marginalized importance sampling as special cases. 
Yunhao Tang · Mark Rowland · Remi Munos · Michal Valko 🔗 


On Overconservatism in Offline Reinforcement Learning
(Poster)
Prior successes in offline learning are highlighted by its adaptability to novel scenarios. One of the key reasons behind this aspect is conservatism, the act of underestimating an agent’s expected value estimates. Recent work, on the other hand, has noted that overconservatism often cripples learning of meaningful behaviors. To that end, the paper asks the question when does overconservatism hurt offline learning? The proposed answer understands conservatism in light of conjugate space and empirical instabilities. In the case of former, agents implicitly aim at learning complex high entropic distributions. As for the latter, overconservatism arises as a consequence of provably inaccurate approximations. Based on theoretical evidence, we address overconservatism through the lens of dynamic control. A feedback controller tunes the learned value estimates by virtue of direct dynamics in the compact latent space. We validate our theoretical insights in an empirical study of aerial control tasks. 
Karush Suri · Florian Shkurti 🔗 


Nonstationary Reinforcement Learning with Linear Function Approximation
(Poster)
We consider reinforcement learning (RL) in episodic Markov decision processes (MDPs) with linear function approximation under drifting environment. Specifically, both the reward and state transition functions can evolve over time, constrained so their total variations do not exceed a given \textit{variation budget}. We first develop $\texttt{LSVIUCBRestart}$ algorithm, an optimistic modification of leastsquares value iteration combined with periodic restart, and bound its dynamic regret when variation budgets are known. We then propose a parameterfree algorithm that works without knowing the variation budgets, \texttt{AdaLSVIUCBRestart}, but with a slightly worse dynamic regret bound. We also derive the first minimax dynamic regret lower bound for nonstationary linear MDPs to show our proposed algorithms are nearoptimal. As a byproduct, we establish a minimax regret lower bound for linear MDPs, which was unsolved by~\citet{jin2020provably}.

Huozhi Zhou · Jinglin Chen · Lav Varshney · Ashish Jagmohan 🔗 


FiniteSample Analysis of OffPolicy TDLearning via Generalized Bellman Operators
(Poster)
Policy evaluation (including multistep offpolicy importance sampling) has the interpretation of solving a generalized Bellman equation. In this paper, we derive finitesample bounds for any general offpolicy TDlike stochastic approximation algorithm that solves for the fixed point of this generalized Bellman operator. Our key step is to show that the generalized Bellman operator is simultaneously a contraction mapping with respect to a weighted $\ell_p$norm for each $p$ in $[1,\infty)$, with a common contraction factor. Our results immediately imply finitesample bounds of variants of offpolicy TDlearning algorithms in the literature (e.g. $Q^\pi(\lambda)$, TreeBackup, Retrace, and $Q$trace).

Zaiwei Chen · Siva Maguluri · Sanjay Shakkottai · Karthikeyan Shanmugam 🔗 


DerivativeFree Policy Optimization for Linear RiskSensitive and Robust Control Design: Implicit Regularization and Sample Complexity
(Poster)
Policybased modelfree reinforcement learning (RL) methods have shown great promise for continuous control applications. However, their performances on risksensitive/robust control tasks have not been fully understood, which has been generally considered to be one important open problem in the seminal work (Fazel et al., 2018). We make a step toward addressing this open problem, by providing the first sample complexity results for policy gradient (PG) methods in two fundamental risksensitive/robust control settings: the linear exponential quadratic Gaussian, and the linearquadratic (LQ) disturbance attenuation problems. The optimization landscapes for these problems are by nature more challenging than that of the LQ regulator problem, due to lack of coercivity of their objective functions. To overcome this challenge, we obtain the first \emph{implicit regularization} results for modelfree PG methods, certifying that the controller \emph{remains robust} during the learning process, which further lead to the sample complexity guarantees. As a byproduct, our results also provide the first sample complexity of PG methods in twoplayer zerosum LQ dynamic games, a baseline in multiagent RL. 
Kaiqing Zhang · Xiangyuan Zhang · Bin Hu · Tamer Basar 🔗 


When Is Generalizable Reinforcement Learning Tractable?
(Poster)
Agents trained by reinforcement learning (RL) often fail to generalize beyond the environment they were trained in, even when presented with new scenarios that seem similar to the training environment. We study the query complexity required to train RL agents that generalize to multiple environments. Intuitively, tractable generalization is only possible when the environments are similar or close in some sense. To capture this, we introduce Weak Proximity, a natural structural condition that requires the environments to have highly similar transition and reward functions and share a policy providing optimal value. Despite such shared structure, we prove that tractable generalization is impossible in the worst case. This holds even when each individual environment can be efficiently solved to obtain an optimal linear policy, and when the agent possesses a generative model. Our lower bound applies to the more complex task of representation learning for efficient generalization to multiple environments. On the positive side, we introduce Strong Proximity, a strengthened condition which we prove is sufficient for efficient generalization. 
Dhruv Malik · Yuanzhi Li · Pradeep Ravikumar 🔗 


FiniteSample Analysis of OffPolicy Natural ActorCritic With Linear Function Approximation
(Poster)
In this paper, we develop a novel variant of offpolicy natural actorcritic algorithm with linear function approximation and we establish a sample complexity of $\mathcal{O}(\epsilon^{3})$, outperforming all the previously known convergence bounds of such algorithms. In order to overcome the divergence due to deadly triad in offpolicy policy evaluation under function approximation, we develop a critic that employs $n$step TDlearning algorithm with a properly chosen $n$. We derive our sample complexity bounds solely based on the assumption that the behavior policy sufficiently explores all the states and actions, which is a much lighter assumption compared to the related literature.

Zaiwei Chen · sajad khodadadian · Siva Maguluri 🔗 


The Importance of NonMarkovianity in Maximum State Entropy Exploration
(Poster)
In the maximum state entropy exploration framework, an agent interacts with a rewardfree environment to learn a policy that maximizes the entropy of the expected state visitations it is inducing. Hazan et al. (2019) noted that the class of Markovian stochastic policies is sufficient for the maximum state entropy objective, and exploiting nonMarkovianity is generally considered pointless in this setting. In this paper, we argue that nonMarkovianity is instead paramount for maximum state entropy exploration in a finitesample regime. Especially, we recast the objective to target the expected entropy of the induced state visitations in a single trial. Then, we show that the class of nonMarkovian deterministic policies is sufficient for the introduced objective, while Markovian policies suffer nonzero regret in general. However, we prove that the problem of finding an optimal nonMarkovian policy is NPhard. Despite this negative result, we discuss avenues to address the problem in a tractable way in future works. 
Mirco Mutti · Riccardo De Santi · Marcello Restelli 🔗 


Global Convergence of MultiAgent Policy Gradient in Markov Potential Games
(Poster)
Potential games are arguably one of the most important and widely studied classes of normal form games. They define the archetypal setting of multiagent coordination as all agent utilities are perfectly aligned with each other via a common potential function. Can this intuitive framework be transplanted in the setting of Markov Games? What are the similarities and differences between multiagent coordination with and without state dependence? We present a novel definition of Markov Potential Games (MPG) that generalizes prior attempts at capturing complex stateful multiagent coordination. Counterintuitively, insights from normalform potential games do not carry over as MPGs can consist of settings where stategames can be zerosum games. In the opposite direction, Markov games where every stategame is a potential game are not necessarily MPGs. Nevertheless, MPGs showcase standard desirable properties such as the existence of deterministic Nash policies. In our main technical result, we prove fast convergence of independent policy gradient to Nash policies by adapting recent gradient dominance property arguments developed for single agent MDPs to multiagent learning settings. 
Stefanos Leonardos · Will Overman · Ioannis Panageas · Georgios Piliouras 🔗 


Efficient Inverse Reinforcement Learning of Transferable Rewards
(Poster)
The reward function is widely accepted as a succinct, robust, and transferable representation of a task. Typical approaches, at the basis of Inverse Reinforcement Learning (IRL), leverage expert demonstrations to recover a reward function. In this paper, we study the theoretical properties of the class of reward functions that are compatible with the expert's behavior. We analyze how the limited knowledge of the expert's policy and the environment affects the reward reconstruction phase. Then, we examine how the error propagates to the learned policy's performance when transferring the reward function to a different environment. We employ these findings to devise a provably efficient active sampling approach, aware of the need for transferring the reward function that can be paired with a large variety of IRL algorithms. 
Giorgia Ramponi · Alberto Maria Metelli · Marcello Restelli 🔗 


Learning to Observe with Reinforcement Learning
(Poster)
We consider a reinforcement learning (RL) framework where the RL agent learns to adjust the accuracy of the observations alongside learning to perform the original task. We are interested in revealing the information structure of the observation space illustrating which type of observations are the most important (such as position versus velocity) and the dependence of this on the state of agent (such as at the bottom versus top of a hill). We approach this problem by associating a cost with collecting observations which increases with the accuracy. In contrast to the existing work that mostly focuses on sample efficiency during training, our focus is on the behaviour of the agent during the actual task. Our results quantify how the RL agent can learn to use the observation space efficiently and obtain satisfactory performance in the original task while collecting effectively smaller amount of data. 
Mehmet Koseoglu · Ece Kunduracioglu · Ayca Ozcelikkale 🔗 


Sample Efficient Reinforcement Learning In Continuous State Spaces: A Perspective Beyond Linearity
(Poster)
Reinforcement learning (RL) is empirically successful in complex nonlinear Markov decision processes (MDPs) with continuous state spaces. By contrast, the majority of theoretical RL literature requires the MDP to satisfy some form of linear structure, in order to guarantee sample efficient RL. Such efforts typically assume the transition dynamics or value function of the MDP are described by linear functions of the state features. To resolve this discrepancy between theory and practice, we introduce the Effective Planning Window (EPW) condition, a structural condition on MDPs that makes \emph{no} linearity assumptions. We demonstrate that the EPW condition permits sample efficient RL, by providing an algorithm which provably solves MDPs satisfying this condition. Our algorithm requires minimal assumptions on the policy class, which can include multilayer neural networks with nonlinear activation functions. Notably, the EPW condition is directly motivated by popular gaming benchmarks, and we show that many classic Atari games satisfy this condition. We additionally show the necessity of conditions like EPW, by demonstrating that simple MDPs with slight nonlinearities cannot be solved sample efficiently. 
Dhruv Malik · Aldo Pacchiano · Vishwak Srinivasan · Yuanzhi Li 🔗 


Bagged Critic for Continuous Control
(Poster)
Actorcritic methods have been successfully applied to several high dimensional continuous control tasks. Despite their success, they are prone to overestimation bias that leads to suboptimal policies and divergent behaviour. Algorithms like TD3 and Soft Actor Critic (SAC) address overestimation bias by employing twin Q functions and optimizing the policy with respect to the lower bound of the two Q functions. Although this resolves the issue of overestimation bias, it inadvertently introduces underestimation bias. Underestimation bias, though not as problematic as overestimation bias, affects the asymptotic performance of the algorithms. To address both overestimation and underestimation bias in actor critic methods, we propose Bagged Critic for Continuous Control (BC3). BC3 uses an ensemble of independently trained Q functions as critic to address estimation biases. We present theoretical bounds on the biases and convergence analysis of our method demonstrating its benefits. The empirical results on several challenging reinforcement learning benchmarks substantiate our theoretical analysis and demonstrate reduction in biases with overall more robust policies. 
Payal Bawa 🔗 


Reinforcement Learning in Linear MDPs: Constant Regret and Representation Selection
(Poster)
We study the role of the representation in finitehorizon Markov Decision Processes (MDPs) with linear structure. We provide a necessary condition for achieving constant regret in any MDP with linear reward representation (even with known dynamics). This result encompasses the wellknown scenario of lowrank MDPs and, more generally, zero inherent Bellman error. We demonstrate that this condition is not only necessary but also sufficient for these classes, by deriving a constant regret bound for two optimistic algorithms. As far as we know, this is the first constant regret result for MDPs. Finally, we study the problem of representation selection showing that our proposed algorithm achieves constant regret when one of the given representations is "good". Furthermore, our algorithm can combine representations and achieve constant regret also when none of the representations would. 
Matteo Papini · Andrea Tirinzoni · Aldo Pacchiano · Marcello Restelli · Alessandro Lazaric · Matteo Pirotta 🔗 


A Fully ProblemDependent Regret Lower Bound for FiniteHorizon MDPs
(Poster)
We derive a novel asymptotic problemdependent lowerbound for regret minimization in finitehorizon tabular Markov Decision Processes (MDPs). While, similar to prior work (e.g., for ergodic MDPs), the lowerbound is the solution to an optimization problem, our derivation reveals the need for an additional constraint on the visitation distribution over stateaction pairs that explicitly accounts for the dynamics of the MDP. We provide a characterization of our lowerbound through a series of examples illustrating how different MDPs may have significantly different complexity. 1) We first consider a ``difficult'' MDP instance, where the novel constraint based on the dynamics leads to a larger lowerbound (i.e., a larger regret) compared to the classical analysis. 2) We then show that our lowerbound recovers results previously derived for specific MDP instances. 3) Finally, we show that, in certain ``simple'' MDPs, the lower bound is considerably smaller than in the general case and it does not scale with the minimum action gap at all. We show that this last result is attainable (up to $poly(H)$ terms) by providing a regret upperbound based on policy gap for an optimistic algorithm.

Andrea Tirinzoni · Matteo Pirotta · Alessandro Lazaric 🔗 


Optimal and instancedependent oracle inequalities for policy evaluation
(Poster)
Linear fixed point equations in Hilbert spaces naturally arise from the policy evaluation problem in reinforcement learning. We study methods that use a collection of random observations to compute approximate solutions by searching over a known lowdimensional subspace of the Hilbert space. First, we prove an instancedependent upper bound on the meansquared error for a linear stochastic approximation scheme that exploits PolyakRuppert averaging. This bound consists of two terms: an approximation error term with an instancedependent approximation factor, and a statistical error term that captures the instancespecific complexity of the noise when projected onto the lowdimensional subspace. Using informationtheoretic methods, we also establish lower bounds showing that the approximation factor cannot be improved, again in an instancedependent sense. A concrete consequence of our characterization is that the optimal approximation factor in this problem can be much larger than a universal constant. We show how our results precisely characterize the error of a class of temporal difference learning methods for the policy evaluation problem with linear function approximation, establishing their optimality. 
Wenlong Mou · Ashwin Pananjady · Martin Wainwright 🔗 


Optimistic Exploration with Backward Bootstrapped Bonus for Deep Reinforcement Learning
(Poster)
Optimism in the face of uncertainty is a principled approach for provably efficient exploration for reinforcement learning in tabular and linear settings. However, such an approach is challenging in developing practical exploration algorithms for Deep Reinforcement Learning (DRL). To address this problem, we propose an Optimistic Exploration algorithm with Backward Bootstrapped Bonus (OEB3) for DRL. We construct an UCBbonus indicating the uncertainty of Qfunctions. The UCBbonus is further utilized to estimate an optimistic Qvalue, which encourages the agent to explore the scarcely visited states and actions to reduce uncertainty. In the estimation of Qfunction, we adopt an episodic backward update strategy to propagate the future uncertainty to the estimated Qfunction consistently. Experiments show that OEB3 outperforms several stateoftheart exploration approaches 49 Atari games. 
Chenjia Bai · Lingxiao Wang · Lei Han · Jianye Hao · Animesh Garg · Peng Liu · Zhaoran Wang 🔗 


RewardWeighted Regression Converges to a Global Optimum
(Poster)
RewardWeighted Regression (RWR) belongs to a family of widely known iterative Reinforcement Learning algorithms based on the ExpectationMaximization framework. In this family, learning at each iteration consists of sampling a batch of trajectories using the current policy and fitting a new policy to maximize a returnweighted loglikelihood of actions. Although RWR is known to yield monotonic improvement of the policy under certain circumstances, whether and under which conditions RWR converges to the optimal policy have remained open questions. In this paper, we provide for the first time a proof that RWR converges to a global optimum when no function approximation is used. 
Francesco Faccio · Rupesh Kumar Srivastava · Jürgen Schmidhuber 🔗 


Comparison and Unification of Three Regularization Methods in Batch Reinforcement Learning
(Poster)
In batch reinforcement learning, there can be poorly explored stateaction pairs resulting in poorly learned, inaccurate models and poorly performing associated policies. Various regularization methods can mitigate the problem of learning overlycomplex models in Markov decision processes (MDPs), however they operate in technically and intuitively distinct ways and lack a common form in which to compare them. This paper unifies three regularization methods in a common framework a weighted average transition matrix. Considering regularization methods in this common form illuminates how the MDP structure and the stateaction pair distribution of the batch data set influence the relative performance of regularization methods. We confirm intuitions generated from the common framework by empirical evaluation across a range of MDPs and data collection policies. 
Sarah Rathnam 🔗 


OracleEfficient Regret Minimization in Factored MDPs with Unknown Structure
(Poster)
We study regret minimization in nonepisodic factored Markov decision processes (FMDPs), where all existing algorithms make the strong assumption that the factored structure of the FMDP is known to the learner in advance. In this paper, we provide the first algorithm that learns the structure of the FMDP while minimizing the regret. Our algorithm is based on the optimism in face of uncertainty principle, combined with a simple statistical method for structure learning, and can be implemented efficiently given oracleaccess to an FMDP planner. Moreover, we give a variant of our algorithm that remains efficient even when the oracle is limited to nonfactored actions, which is the case with almost all existing approximate planners. Finally, we leverage our techniques to prove a novel lower bound for the known structure case, closing the gap to the regret bound of Chen et al. [2021]. 
Aviv Rosenberg · Yishay Mansour 🔗 


Learning Adversarial Markov Decision Processes with Delayed Feedback
(Poster)
Reinforcement learning typically assumes that the agent observes feedback for its actions immediately, but in many realworld applications (like recommendation systems) the feedback is observed in delay. In this paper, we study online learning in episodic Markov decision processes (MDPs) with unknown transitions, adversarially changing costs and unrestricted delayed feedback. That is, the costs and trajectory of episode $k$ are revealed to the learner only in the end of episode $k + d^k$, where the delays $d^k$ are neither identical nor bounded, and are chosen by an oblivious adversary. We present novel algorithms based on policy optimization that achieve nearoptimal highprobability regret of $\sqrt{K + D}$ under fullinformation feedback, where $K$ is the number of episodes and $D = \sum_{k} d^k$ is the total delay. Under bandit feedback, we prove similar $\sqrt{K + D}$ regret assuming the costs are stochastic, and $(K + D)^{2/3}$ regret in the general case. We are the first to consider regret minimization in the important setting of MDPs with delayed feedback.

Tal Lancewicki · Aviv Rosenberg · Yishay Mansour 🔗 


Why Generalization in RL is Difficult: Epistemic POMDPs and Implicit Partial Observability
(Poster)
Generalization is a central challenge for the deployment of reinforcement learning (RL) systems in the real world. In this paper, we show that the sequential structure of the RL problem necessitates new approaches to generalization beyond the wellstudied techniques used in supervised learning. While supervised learning methods can generalize effectively without explicitly accounting for epistemic uncertainty, we describe why appropriate uncertainty handling can actually be essential in RL. We show that generalization to unseen test conditions from a limited number of training conditions induces a kind of implicit partial observability, effectively turning even fullyobserved MDPs into POMDPs. Informed by this observation, we recast the problem of generalization in RL as solving the induced partially observed Markov decision process, which we call the epistemic POMDP. We demonstrate the failure modes of algorithms that do not appropriately handle this partial observability, and suggest a simple ensemblebased technique for approximately solving the partially observed problem. Empirically, we demonstrate that our simple algorithm derived from the epistemic POMDP achieves significant gains in generalization over current methods on the Procgen benchmark suite. 
Dibya Ghosh · Jad Rahme · Aviral Kumar · Amy Zhang · Ryan P. Adams · Sergey Levine 🔗 


Statistical Inference with MEstimators on Adaptively Collected Data
(Poster)
Bandit algorithms are increasingly used in realworld sequential decisionmaking problems. Associated with this is an increased desire to be able to use the resulting datasets to answer scientific questions like: Did one type of ad lead to more purchases? In which contexts is a mobile health intervention effective? However, classical statistical approaches fail to provide valid confidence intervals when used with data collected with bandit algorithms. Alternative methods have recently been developed for simple models (e.g., comparison of means). Yet there is a lack of general methods for conducting statistical inference using more complex models on data collected with (contextual) bandit algorithms; for example, current methods cannot be used for valid inference on parameters in a logistic regression model for a binary reward. In this work, we develop theory justifying the use of Mestimatorswhich includes estimators based on empirical risk minimization as well as maximum likelihoodon data collected with adaptive algorithms, including (contextual) bandit algorithms. Specifically, we show that Mestimators, modified with particular adaptive weights, can be used to construct asymptotically valid confidence regions for a variety of inferential targets. 
Kelly Zhang · Lucas Janson · Susan Murphy 🔗 


Randomized Least Squares Policy Optimization
(Poster)
Policy Optimization (PO) methods with function approximation are one of the most popular classes of Reinforcement Learning (RL) algorithms. However, designing provably efficient policy optimization algorithms remains a challenge. Recent work in this area has focused on incorporating upper confidence bound (UCB)style bonuses to drive exploration in policy optimization. In this paper, we present Randomized Least Squares Policy Optimization (RLSPO) which is inspired by Thompson Sampling. We prove that, in an episodic linear kernel MDP setting, RLSPO achieves $\Tilde{\mathcal{O}}(d^{3/2} H^{3/2} \sqrt{T})$ worstcase (frequentist) regret, where $H$ is the number of episodes, $T$ is the total number of steps and $d$ is the feature dimension. Finally, we evaluate RLSPO empirically and show that it is competitive with existing provably efficient PO algorithms.

Haque Ishfaq · Zhuoran Yang · Andrei Lupu · Viet Nguyen · Lewis Liu · Riashat Islam · Zhaoran Wang · Doina Precup 🔗 


GapDependent Unsupervised Exploration for Reinforcement Learning
(Poster)
For the problem of taskagnostic reinforcement learning (RL), an agent first collects samples from an unknown environment without the supervision of reward signals, then is revealed with a reward and is asked to compute a corresponding nearoptimal policy. Existing approaches mainly concern the worstcase scenarios, in which no structural information of the reward/transitiondynamics is utilized. Therefore the best sample upper bound is $\propto\widetilde{\mathcal{O}}(1/\epsilon^2)$, where $\epsilon>0$ is the target accuracy of the obtained policy, and can be overly pessimistic. To tackle this issue, we provide an efficient algorithm that utilizes a gap parameter, $\rho>0$, to reduce the amount of exploration. In particular, for an unknown finitehorizon Markov decision process, the algorithm takes only $\widetilde{\mathcal{O}} (1/\epsilon \cdot (H^3SA / \rho + H^4 S^2 A) )$ episodes of exploration, and is able to obtain an $\epsilon$optimal policy for a postrevealed reward with suboptimality gap at least $\rho$, where $S$ is the number of states, $A$ is the number of actions, and $H$ is the length of the horizon, obtaining a nearly \emph{quadratic saving} in terms of $\epsilon$. We show that, informationtheoretically, this bound is nearly tight for $\rho < \Theta(1/(HS))$ and $H>1$. We further show that $\propto\widetilde{\mathcal{O}}(1)$ sample bound is possible for $H=1$ (i.e., multiarmed bandit) or with a sampling simulator, establishing a stark separation between those settings and the RL setting.

Jingfeng Wu · Vladimir Braverman · Lin Yang 🔗 


Online Learning for Stochastic Shortest Path Model via Posterior Sampling
(Poster)
We consider the problem of online reinforcement learning for the Stochastic Shortest Path (SSP) problem modeled as an unknown MDP with an absorbing state. We propose PSRLSSP, a simple posterior samplingbased reinforcement learning algorithm for the SSP problem. The algorithm operates in epochs. At the beginning of each epoch, a sample is drawn from the posterior distribution on the unknown model dynamics, and the optimal policy with respect to the drawn sample is followed during that epoch. An epoch completes if either the number of visits to the goal state in the current epoch exceeds that of the previous epoch, or the number of visits to any of the stateaction pairs is doubled. We establish a Bayesian regret bound of $O(B_* S\sqrt{AK})$, where $B_*$ is an upper bound on the expected cost of the optimal policy, $S$ is the size of the state space, $A$ is the size of the action space, and $K$ is the number of episodes. The algorithm only requires the knowledge of the prior distribution, and has no hyperparameters to tune. It is the first such posterior sampling algorithm and outperforms numerically previously proposed optimismbased algorithms.

Mehdi Jafarnia · Liyu Chen · Rahul Jain · Haipeng Luo 🔗 


Provable Modelbased Nonlinear Bandit and Reinforcement Learning: Shelve Optimism, Embrace Virtual Curvature
(Poster)
This paper studies modelbased bandit and reinforcement learning (RL) with nonlinear function approximations. We propose to study convergence to approximate local maxima because we show that global convergence is statistically intractable even for onelayer neural net bandit with a deterministic reward. For both nonlinear bandit and RL, the paper presents a modelbased algorithm, Virtual Ascent with Online Model Learner (ViOlin), which provably converges to a local maximum with sample complexity that only depends on the sequential Rademacher complexity of the model class. Our bounds imply novel results on several concrete settings such as linear bandit with finite model class or sparse models, and twolayer neural net bandit. A key algorithmic insight is that optimism may lead to overexploration even for onelayer neural net model class. On the other hand, for convergence to local maxima, it suffices to maximize the virtual return if the model can also predict the size of the gradient and Hessian of the return. 
Kefan Dong · Jiaqi Yang · Tengyu Ma 🔗 


Linear Convergence of EntropyRegularized Natural Policy Gradient with Linear Function Approximation
(Poster)
Natural policy gradient (NPG) methods with function approximation achieve impressive empirical success in reinforcement learning problems with large stateaction spaces. However, theoretical understanding of their convergence behaviors remains limited in the function approximation setting. In this paper, we perform a finitetime analysis of NPG with linear function approximation and softmax parameterization, and prove for the first time that widely used entropy regularization method, which encourages exploration, leads to linear convergence rate. We adopt a Lyapunov drift analysis to prove the convergence results and explain the effectiveness of entropy regularization in improving the convergence rates. 
Semih Cayci · Niao He · R Srikant 🔗 


Decentralized QLearning in Zerosum Markov Games
(Poster)
We study multiagent reinforcement learning (MARL) in infinitehorizon discounted zerosum Markov games. We focus on the practical but challenging setting of decentralized MARL, where agents make decisions without coordination by a centralized controller, but only based on their own payoffs and local actions executed. The agents need not observe the opponent’s actions or payoffs, possibly being even oblivious to the presence of the opponent, nor be aware of the zerosum structure of the underlying game, a setting also referred to as radically uncoupled in the literature of learning in games. In this paper, we develop for the first time a radically uncoupled Qlearning dynamics that is both rational and convergent: the learning dynamics converges to the best response to the opponent’s strategy when the opponent follows an asymptotically stationary strategy; the value function estimates converge to the payoffs at a Nash equilibrium when both agents adopt the dynamics. The key challenge in this decentralized setting is the nonstationarity of the learning environment from an agent’s perspective, since both her own payoffs and the system evolution depend on the actions of other agents, and each agent adapts their policies simultaneously and independently. To address this issue, we develop a twotimescale learning dynamics where each agent updates her local Qfunction and value function estimates concurrently, with the latter happening at a slower timescale. 
Kaiqing Zhang · David Leslie · Tamer Basar · Asuman Ozdaglar 🔗 


Modelbased Offline Reinforcement Learning with Local Misspecification
(Poster)
In this paper we propose a modelbased offline reinforcement learning algorithm that explicitly handles model misspecification and distribution mismatch. Theoretically, we prove a safe policy improvement theorem by establishing pessimism approximations to the value function. Our algorithm can output the best policy in the given policy class with interpretable error terms measuring misspecification level, distribution mismatch, and statistical deviation. In addition, as long as the model family can approximate the transitions of stateaction pairs visits by a policy, we can approximate the value of that particular policy. We visualize the effect of error terms in the LQR setting, and show that the experiment results match our theory. 
Kefan Dong · Ramtin Keramati · Emma Brunskill 🔗 


Nearly Minimax Optimal Regret for Learning Infinitehorizon Averagereward MDPs with Linear Function Approximation
(Poster)
We study reinforcement learning in an infinitehorizon averagereward setting with linear function approximation, where the transition probability function of the underlying Markov Decision Process (MDP) admits a linear form over a feature mapping of the current state, action and next state. We propose a new algorithm UCRL2VTR, which can been seen as an extension of UCRL2 algorithm with linear function approximation. We show that UCRL2VTR with Bernsteintype bonus can achieve a regret of $\tilde{O}(d\sqrt{DT})$, where $d$ is the dimension of the feature mapping, $T$ is the horizon and $D$ is the diameter of the MDP. We also prove a matching lower bound $\tilde{\Omega}(d\sqrt{DT})$, which suggests that the proposed UCRL2VTR is minimax optimal up to logarithmic factors. To the best of our knowledge, our algorithm is the first nearly minimax optimal RL algorithm with function approximation in the infinitehorizon averagereward setting.

Yue Wu · Dongruo Zhou · Quanquan Gu 🔗 


Learning from an Exploring Demonstrator: Optimal Reward Estimation for Bandits
(Poster)
We introduce the ``inverse bandit'' problem of estimating the rewards of a multiarmed bandit instance from observing the learning process of a lowregret demonstrator. Existing approaches to the related problem of inverse reinforcement learning assume the execution of an optimal policy, and thereby suffer from an identifiability issue. In contrast, our paradigm leverages the demonstrator's behavior en route to optimality, and in particular, the exploration phase, to obtain consistent reward estimates. We develop simple and efficient reward estimation procedures for demonstrations within a class of upperconfidencebased algorithms, showing that reward estimation gets progressively easier as the regret of the algorithm increases. We match these upper bounds with informationtheoretic lower bounds that apply to any demonstrator algorithm, thereby characterizing the optimal tradeoff between exploration and reward estimation. Extensive simulations on both synthetic and semisynthetic data corroborate our theoretical results. 
Wenshuo Guo · Kumar Agrawal · Aditya Grover · Vidya Muthukumar · Ashwin Pananjady 🔗 


ModelFree Approach to Evaluate Reinforcement Learning Algorithms
(Poster)
Policy evaluation is an important instrument for the comparison of different algorithms in Reinforcement Learning (RL). Yet even a precise knowledge of the value function $V^{\pi}$ corresponding to a policy $\pi$ does not provide reliable information on how far is the policy $\pi$ from the optimal one. We present a novel modelfree upper value iteration procedure ({\sf UVIP}) that allows us to estimate the suboptimality gap $V^{\star}(x)  V^{\pi}(x)$ from above and to construct confidence intervals for $V^\star$. Our approach relies on upper bounds to the solution of the Bellman optimality equation via martingale approach. We provide theoretical guarantees for {\sf UVIP} under general assumptions and illustrate its performance on a number of benchmark RL problems.

Denis Belomestny · Ilya Levin · Eric Moulines · Alexey Naumov · Sergey Samsonov · Veronika Zorina 🔗 


Provable RL with Exogenous Distractors via Multistep Inverse Dynamics
(Poster)
Many realworld applications of reinforcement learning (RL) require the agent to deal with highdimensional observations such as those generated from a megapixel camera. Prior work has addressed such problems with representation learning, through which the agent can provably extract endogenous, latent state information from raw observations and subsequently plan efficiently. However, such approaches can fail in the presence of structured noise that is temporally correlated, a phenomenon that is common in practice. We initiate the formal study of latent state discovery in the presence of such exogenous noise sources by proposing a new model, the Exogenous Block MDP, for rich observation RL. We establish structural results and provide new sample efficiency guarantees for learning in these models. For the latter, we show that the new DPCID variant of PCID, previously studied by~\citet{du2019provably}, learns a generalization of inverse dynamics and is provably sample and computationally efficient in Exogenous Block MDPs when the endogenous state dynamics are near deterministic. The sample complexity of DPCID depends polynomially on the size of a latent endogenous state space while not directly depending on the size of the observation space. We provide experiments on challenging exploration problems and show the promise of our approach. 
Yonathan Efroni · Dipendra Misra · Akshay Krishnamurthy · Alekh Agarwal · John Langford 🔗 


Learning ParetoOptimal Policies in LowRank Cooperative Markov Games
(Poster)
We study cooperative multiagent reinforcement learning in episodic Markov games with $n$ agents. While the global state and action spaces typically grow exponentially in $n$ in this setting, in this paper, we introduce a novel framework that combines function approximation and a graphical dependence structure that restricts the decision space to $o(dn)$ for each agent, where $d$ is the ambient dimensionality of the problem. We present a multiagent value iteration algorithm that, under mild assumptions, provably recovers the set of Paretooptimal policies, with finitesample guarantees on the incurred regret. Furthermore, we demonstrate that our algorithm is {\em noregret} even when there are only $\cO(1)$ episodes with communication, providing a scalable and provably noregret algorithm for multiagent reinforcement learning with function approximation. Our work provides a tractable approach to multiagent decisionmaking that is provably efficient and amenable to largescale collaborative systems.

Abhimanyu Dubey · Alex `Sandy' Pentland 🔗 


Optimal Uniform OPE and Modelbased Offline Reinforcement Learning in TimeHomogeneous, RewardFree and TaskAgnostic Settings
(Poster)
This work studies the statistical limits of uniform convergence for offline policy evaluation (OPE) problems with modelbased methods (for episodic MDP) and provides a unified framework towards optimal learning for several wellmotivated offline tasks. We establish an $\Omega(H^2 S/d_m\epsilon^2)$ lower bound (over modelbased family) for the global uniform OPE and our main result establishes an upper bound of $\tilde{O}(H^2/d_m\epsilon^2)$ for the \emph{local} uniform convergence. The highlight in achieving the optimal rate $\tilde{O}(H^2/d_m\epsilon^2)$ is our design of \emph{singleton absorbing MDP}, which is a new sharp analysis tool that works with the modelbased approach. We generalize such a modelbased framework to the new settings: offline taskagnostic and the offline rewardfree with optimal complexity $\tilde{O}(H^2\log(K)/d_m\epsilon^2)$ ($K$ is the number of tasks) and $\tilde{O}(H^2S/d_m\epsilon^2)$ respectively. These results provide a unified solution for simultaneously solving different offline RL problems.

Ming Yin · YuXiang Wang 🔗 


Bridging The Gap between Local and Joint Differential Privacy in RL
(Poster)
In this paper, we study privacy in the context of finitehorizon Markov Decision Processes. Two notions of privacy have been investigated in this setting: joint differential privacy (JDP) and local differential privacy (LDP). We show that it is possible to achieve a smooth transition in terms of privacy and regret (i.e., utility) between JDP and LDP. By leveraging shuffling techniques, we present an algorithm that, depending on the provided parameter, is able to attain any privacy/utility value in between the pure JDP and LDP guarantee. 
Evrard Garcelon · Vianney Perchet · Ciara PikeBurke · Matteo Pirotta 🔗 


NearOptimal Offline Reinforcement Learning via Double Variance Reduction
(Poster)
We consider the problem of offline reinforcement learning (RL)  a wellmotivated setting of RL that aims at policy optimization using only historical data. In this paper, we propose \emph{OffPolicy Double Variance Reduction} (OPDVR), a new variance reduction based algorithm for offline RL. Our main result shows that OPDVR provably identifies an $\epsilon$optimal policy with $\widetilde{O}(H^2/d_m\epsilon^2)$ episodes of offline data in the finitehorizon \emph{stationary transition} setting and this improves over the best known upper bound by a factor of $H$. Moreover, we establish an informationtheoretic lower bound of $\Omega(H^2/d_m\epsilon^2)$ which certifies that OPDVR is optimal up to logarithmic factors.
Lastly, we show that OPDVR also achieves rateoptimal sample complexity under alternative settings such as the finitehorizon MDPs with nonstationary transitions and the infinite horizon MDPs with discounted rewards.

Ming Yin · Yu Bai · YuXiang Wang 🔗 


Mixture of Step Returns in Bootstrapped DQN
(Poster)
The concept of utilizing multistep returns for updating value functions has long been adopted in the reinforcement learning domain. Conventional methods such as TDlambda further extend this concept and use a single target value equivalent to an exponential average of different step returns. Nevertheless, different backup lengths provide diverse advantages in terms of bias and variance of value estimates, convergence speeds, and learning behaviors of the agent. Integrating step returns into a single target sacrifices the advantages offered by different step return targets. In order to address this issue, we propose Mixture Bootstrapped DQN (MBDQN) and employ different backup lengths for different bootstrapped heads. MBDQN enables diversity of the target values that is unavailable in approaches relying only on a single target value. In this paper, we first comprehensively highlight the motivational insights through a simple maze environment. Then, in order to validate the effectiveness of MBDQN, we perform experiments on various Atari 2600 benchmark environments, and demonstrate the performance improvement of MBDQN over the baseline methods. Finally, we provide ablation analyses to verify MBDQN in a set of analytical cases. 
PoHan Chiang · HsuanKung Yang · ZhangWei Hong · ChunYi Lee 🔗 


Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function Approximation
(Poster)
Learning Markov decision processes (MDPs) in the presence of the adversary is a challenging problem in reinforcement learning (RL). In this paper, we study RL in episodic MDPs with adversarial reward and full information feedback, where the unknown transition probability function is a linear function of a given feature mapping, and the reward function can change arbitrarily episode by episode. We propose an optimistic policy optimization algorithm POWER and show that it can achieve $\tilde{O}(dH\sqrt{T})$ regret, where $H$ is the length of the episode, $T$ is the number of interaction with the MDP, and $d$ is the dimension of the feature mapping. Furthermore, we also prove a matching lower bound of $\tilde{\Omega}(dH\sqrt{T})$ up to logarithmic factors. Our key technical contributions are twofold: (1) a new value function estimator based on importance weighting; and (2) a tighter confidence set for the transition kernel. They together lead to the nearly minimax optimal regret.

Jiafan He · Dongruo Zhou · Quanquan Gu 🔗 


Provably efficient explorationfree transfer RL for neardeterministic latent dynamics
(Poster)
Sample complexity and robustness are critical for applying reinforcement learning (RL) algorithms in realworld applications. We study the sample saving opportunities via transferring experience when the source domain is implemented by a simulator. Many realworld domains are well approximated by 
Yao Liu · Dipendra Misra · Miroslav Dudik · Robert Schapire 🔗 


Stochastic Shortest Path: Minimax, ParameterFree and Towards HorizonFree Regret
(Poster)
We study the problem of learning in the stochastic shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state. We design a novel modelbased algorithm EBSSP that carefully skews the empirical transitions and perturbs the empirical costs with an exploration bonus to induce an optimistic SSP problem whose associated value iteration scheme is guaranteed to converge. We prove that EBSSP achieves the minimax regret rate $\widetilde{O}(B_{\star} \sqrt{S A K})$, where $K$ is the number of episodes, $S$ is the number of states, $A$ is the number of actions, and $B_{\star}$ bounds the expected cumulative cost of the optimal policy from any state, thus closing the gap with the lower bound. Interestingly, EBSSP obtains this result while being parameterfree, i.e., it does not require any prior knowledge of $B_{\star}$, nor of $T_{\star}$, which bounds the expected timetogoal of the optimal policy from any state. We also show various cases (e.g., positive costs, or general costs when an orderaccurate estimate of $T_{\star}$ is available) where the regret only contains a logarithmic dependence on $T_{\star}$, thus yielding the first (nearly) horizonfree regret bound beyond the finitehorizon MDP setting.

Jean Tarbouriech · Jean Tarbouriech · Simon Du · Matteo Pirotta · Michal Valko · Alessandro Lazaric 🔗 


A Spectral Approach to OffPolicy Evaluation for POMDPs
(Poster)
We consider the offpolicy evaluation problem in POMDPs. Prior work on this problem uses a causal identification strategy based on onestep observable proxies of the hidden state (Tennenholtz et al., 2020a). In this work, we relax the assumptions made in the prior work by using spectral methods. We further relax these assumptions by extending onestep proxies into the past. Finally, we derive an importance sampling algorithm which assumes rank, distinctness, and positivity conditions on certain probability matrices, and not on sufficiency conditions of observable trajectories with respect to the reward and hidden state structure required in the prior work. 
Yash Nair · Nan Jiang 🔗 


Mind the Gap: Safely Bridging Offline and Online Reinforcement Learning
(Poster)
A key challenge to deploying reinforcement learning in practice is exploring safely. We propose a natural safety propertyuniformly outperforming a conservative policy (adaptively estimated from all data observed thus far), up to a perepisode exploration budget. This property formalizes the idea that we should spread out exploration to avoid taking actions significantly worse than the ones that are currently known to be good. We then design an algorithm that uses a UCB reinforcement learning policy for exploration, but overrides it as needed to ensure safety with high probability. To ensure exploration across the entire state space, it adaptively determines when to explore (at different points of time across different episodes) in a way that allows "stitching" subepisodes together to obtain a metaepisode that is equivalent to using UCB for the entire episode. Then, we establish reasonable assumptions about the underlying MDP under which our algorithm is guaranteed to achieve sublinear regret while ensuring safety; under these assumptions, the cost of imposing safety is only a constant factor. 
Wanqiao Xu · Kan Xu · Hamsa Bastani · Osbert Bastani 🔗 


Learning Nash Equilibria in ZeroSum Stochastic Games via EntropyRegularized Policy Approximation
(Poster)
We explore the use of policy approximations to reduce the computational cost of learning Nash equilibria in zerosum stochastic games. We propose a new Qlearning type algorithm that uses a sequence of entropyregularized soft policies to approximate the Nash policy during the Qfunction updates. We prove that under certain conditions, by updating the entropy regularization, the algorithm converges to a Nash equilibrium. Numerical simulations verify that the proposed algorithm converges to Nash with a major speedup over existing algorithms. 
Yue Guan · Qifan Zhang · Panagiotis Tsiotras 🔗 


Invariant Policy Learning: A Causal Perspective
(Poster)
In the past decade, contextual bandit and reinforcement learning algorithms have been successfully used in various interactive learning systems such as online advertising, recommender systems, and dynamic pricing. However, they have yet to be widely adopted in highstakes application domains, such as healthcare. One reason may be that existing approaches assume that the underlying mechanisms are static in the sense that they do not change over different environments. In many real world systems, however, the mechanisms are subject to shifts across environments which may invalidate the static environment assumption. In this paper, we tackle the problem of environmental shifts under the framework of offline contextual bandits. We view the environmental shift problem through the lens of causality and propose multienvironment contextual bandits that allow for changes in the underlying mechanisms. We adopt the concept of invariance from the causality literature and introduce the notion of policy invariance. We argue that policy invariance is only relevant if unobserved confounders are present and show that, in that case, an optimal invariant policy is guaranteed to generalize across environments under suitable assumptions. Our results do not only provide a solution to the environmental shift problem but also establish concrete connections among causality, invariance and contextual bandits. 
Sorawit Saengkyongam · Nikolaj Thams · Jonas Peters · Niklas Pfister 🔗 


A functional mirror ascent view of policy gradient methods with function approximation
(Poster)
We use functional mirror ascent to propose a general framework (referred to as FMAPG) for designing policy gradient methods. The functional perspective distinguishes between a policy's functional representation (what are its sufficient statistics) and its parameterization (how are these statistics represented), and naturally results in computationally efficient offpolicy updates. For simple policy parameterizations, the FMAPG framework ensures that the optimal policy is a fixed point of the updates. It also allows us to handle complex policy parameterizations (e.g., neural networks) while guaranteeing policy improvement. Our framework unifies several PG methods and opens the way for designing sampleefficient variants of existing methods. Moreover, it recovers important implementation heuristics (e.g., using forward vs reverse KL divergence) in a principled way. With a softmax functional representation, FMAPG results in a variant of TRPO with additional desirable properties. It also suggests an improved variant of PPO, whose robustness we empirically demonstrate on MuJoCo. 
Sharan Vaswani · Olivier Bachem · Simone Totaro · Matthieu Geist · Marlos C. Machado · Pablo Samuel Castro · Nicolas Le Roux 🔗 


Policy Finetuning: Bridging SampleEfficient Offline and Online Reinforcement Learning
(Poster)
This paper initiates the theoretical study of \emph{policy finetuning} towards bridging the gap between (sampleefficient) online RL and offline RL. In this problem, the online RL learner has additional access to a ``reference policy'' $\mu$ close to the optimal policy $\pi_\star$ in a certain sense. We first design a sharp \emph{offline reduction} algorithmwhich simply executes $\mu$ and runs offline policy optimization on the collected datasetthat finds an $\varepsilon$ nearoptimal policy within $\widetilde{O}(H^3SC^\star/\varepsilon^2)$ episodes, where $C^\star$ is the singlepolicy concentrability coefficient between $\mu$ and $\pi_\star$. This offline result is the first that matches the sample complexity lower bound in this setting, and resolves a recent open question in offline RL. We then establish an $\Omega(H^3S\min\{C^\star, A\}/\varepsilon^2)$ sample complexity lower bound for \emph{any} policy finetuning algorithm, including those that can adaptively explore the environment. This implies thatperhaps surprisinglythe optimal policy finetuning algorithm is either offline reduction or a purely online RL algorithm that does not use $\mu$. Finally, we design a new hybrid offline/online algorithm for policy finetuning that achieves better sample complexity than both vanilla offline reduction and purely online RL algorithms, in a relaxed setting where $\mu$ only satisfies concentrability partially up to a certain time step. Overall, our results offer a quantitative understanding on the benefit of a good reference policy, and make a step towards bridging offline and online RL.

Tengyang Xie · Nan Jiang · Huan Wang · Caiming Xiong · Yu Bai 🔗 


Robust online control with model misspecification
(Poster)
We study online control of an unknown nonlinear dynamical system that is approximated by a timeinvariant linear system with model misspecification. Our study focuses on {\bf robustness}, which measures how much deviation from the assumed linear approximation can be tolerated while maintaining a bounded L2gain.
Some models cannot be stabilized even with perfect knowledge of their coefficients: the robustness is limited by the minimal distance between the assumed dynamics and the set of unstabilizable dynamics. Therefore it is necessary to assume a lower bound on this distance. Under this assumption, and with full observation of the $d$ dimensional state, we describe an efficient controller that attains $\Omega(\frac{1}{\sqrt{d}})$ robustness together with an L2gain whose dimension dependence is near optimal. We also give an inefficient algorithm that attains constant robustness independent of the dimension, with a finite but suboptimal L2gain.

Xinyi Chen · Udaya Ghai · Elad Hazan · Alexandre Megretsky 🔗 


Online SubSampling for Reinforcement Learning with General Function Approximation
(Poster)
Designing provably efficient algorithms with general function approximation is an important open problem in reinforcement learning. Recently, Wang et al.~[2020c] establish a valuebased algorithm with general function approximation that enjoys $\widetilde{O}(\mathrm{poly}(dH)\sqrt{K})$\footnote{Throughout the paper, we use $\widetilde{O}(\cdot)$ to suppress logarithm factors. } regret bound, where $d$ depends on the complexity of the function class, $H$ is the planning horizon, and $K$ is the total number of episodes. However, their algorithm requires $\Omega(K)$ computation time per round, rendering the algorithm inefficient for practical use. In this paper, by applying online subsampling techniques, we develop an algorithm that takes $\widetilde{O}(\mathrm{poly}(dH))$ computation time per round on average, and enjoys nearly the same regret bound. Furthermore, the algorithm achieves low switching cost, i.e., it changes the policy only $\widetilde{O}(\mathrm{poly}(dH))$ times during its execution, making it appealing to be implemented in reallife scenarios. Moreover, by using an upperconfidence based explorationdriven reward function, the algorithm provably explores the environment in the rewardfree setting. In particular, after $\widetilde{O}(\mathrm{poly}(dH))/\epsilon^2$ rounds of exploration, the algorithm outputs an $\epsilon$optimal policy for any given reward function.

Dingwen Kong · Ruslan Salakhutdinov · Ruosong Wang · Lin Yang 🔗 


Is Pessimism Provably Efficient for Offline RL?
(Poster)
We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a dataset collected a priori. Due to the lack of further interactions with the environment, offline RL suffers from the insufficient coverage of the dataset, which eludes most existing theoretical analysis. In this paper, we propose a \underline{pe}ssimistic variant of the \underline{v}alue \underline{i}teration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function. Such a penalty function simply flips the sign of the bonus function for promoting exploration in online RL, which makes it easily implementable and compatible with general function approximators. Without assuming the sufficient coverage of the dataset, we establish a datadependent upper bound on the suboptimality of PEVI for general Markov decision processes (MDPs). When specialized to linear MDPs, it matches the informationtheoretic lower bound up to multiplicative factors of the dimension and horizon. In other words, pessimism is not only provably efficient but also minimax optimal. In particular, given the dataset, the learned policy serves as the 
Ying Jin · Zhuoran Yang · Zhaoran Wang 🔗 


Topological Experience Replay for Fast QLearning
(Poster)
Stateoftheart deep \Qlearning methods update \Qvalues using state transition tuples sampled from the experience replay buffer. Often this strategy is to randomly sample or prioritize data sampling based on measures such as the temporal difference (TD) error. Such sampling strategies are agnostic to the structure of the Markov decision process (MDP) and can therefore be data inefficient at propagating reward signals from goal states to the initial state. To accelerate reward propagation, we make use of the MDP structure by organizing the agent's experience into a graph. Each edge in the graph represents a transition between two connected states. We perform value backups via a breadthfirst search that expands vertices in the graph starting from the set of terminal states successively moving backward. We empirically show that our method is substantially more dataefficient than several baselines on sparse reward tasks. 
ZhangWei Hong · Tao Chen · YenChen Lin · Joni Pajarinen · Pulkit Agrawal 🔗 


Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs
(Poster)
We study the reinforcement learning problem for discounted Markov Decision Processes (MDPs) under the tabular setting. We propose a modelbased algorithm named UCBVI$\gamma$, which is based on the \emph{optimism in the face of uncertainty principle} and the Bernsteintype bonus. We show that UCBVI$\gamma$ achieves an $\tilde{O}\big({\sqrt{SAT}}/{(1\gamma)^{1.5}}\big)$ regret, where $S$ is the number of states, $A$ is the number of actions, $\gamma$ is the discount factor and $T$ is the number of steps. In addition, we construct a class of hard MDPs and show that for any algorithm, the expected regret is at least $\tilde{\Omega}\big({\sqrt{SAT}}/{(1\gamma)^{1.5}}\big)$. Our upper bound matches the minimax lower bound up to logarithmic factors, which suggests that UCBVI$\gamma$ is nearly minimax optimal for discounted MDPs.

Jiafan He · Dongruo Zhou · Quanquan Gu 🔗 


A general sample complexity analysis of vanilla policy gradient
(Poster)
The policy gradient (PG) is one of the most popular methods for solving reinforcement learning (RL) problems. However, a solid theoretical understanding of even the ``vanilla'' PG has remained elusive for long time. In this paper, we apply recent tools developed for the analysis of SGD in nonconvex optimization to obtain convergence guarantees for both REINFORCE and GPOMDP under smoothness assumption on the objective function and weak conditions on the second moment of the norm of the estimated gradient. When instantiated under common assumptions on the policy space, our general result immediately recovers existing $O(\epsilon^{4})$ sample complexity guarantees, but for wider ranges of parameters (e.g., step size and batch size m) w.r.t. previous literature. Notably, our result includes the single trajectory case (i.e., m=1) and it provides a more accurate analysis of the dependency on problemspecific parameters by fixing previous results available in the literature. We believe that the integration of stateoftheart tools from nonconvex optimization may lead to identify a much broader range of problems where PG methods enjoy strong theoretical guarantees.

Rui Yuan · Robert Gower · Alessandro Lazaric 🔗 


The Power of Exploiter: Provable MultiAgent RL in Large State Spaces
(Poster)
Modern reinforcement learning (RL) commonly engages practical problems with large state spaces, where function approximation must be deployed to approximate either the value function or the policy. While recent progresses in RL theory address a rich set of RL problems with general function approximation, such successes are mostly restricted to the singleagent setting. It remains elusive how to extend these results to multiagent RL, especially due to the new challenges arising from its gametheoretical nature. This paper considers twoplayer zerosum Markov Games (MGs). We propose a new algorithm that can provably find the Nash equilibrium policy using a polynomial number of samples, for any MG with low multiagent BellmanEluder dimensiona new complexity measure adapted from its singleagent version (Jin et al., 2021). A key component of our new algorithm is the exploiter, which facilitates the learning of the main player by deliberately exploiting her weakness. Our theoretical framework is generic, which applies to a wide range of models including but not limited to tabular MGs, MGs with linear or kernel function approximation, and MGs with rich observations. 
Chi Jin · Qinghua Liu · Tiancheng Yu 🔗 


Bellman Eluder Dimension: New Rich Classes of RL Problems, and SampleEfficient Algorithms
(Poster)
Finding the minimal structural assumptions that empower sampleefficient learning is one of the most important research directions in Reinforcement Learning (RL). This paper advances our understanding of this fundamental question by introducing a new complexity measure—Bellman Eluder (BE) dimension. We show that the family of RL problems of low BE dimension is remarkably rich, which subsumes a vast majority of existing tractable RL problems including but not limited to tabular MDPs, linear MDPs, reactive POMDPs, low Bellman rank problems as well as low Eluder dimension problems. This paper further designs a new optimizationbased algorithm— GOLF, and reanalyzes a hypothesis eliminationbased algorithm—OLIVE (proposed in Jiang et al., 2017). We prove that both algorithms learn the nearoptimal policies of low BE dimension problems in a number of samples that is polynomial in all relevant parameters, but independent of the size of stateaction space. Our regret and sample complexity results match or improve the best existing results for several wellknown subclasses of low BE dimension problems. 
Chi Jin · Qinghua Liu · Sobhan Miryoosefi 🔗 


Estimating Optimal Policy Value in Linear Contextual Bandits beyond Gaussianity
(Poster)
While much of bandit learning focuses on minimizing regret while learning an optimal policy, it is often of interest to estimate the maximum value achievable before learning the optimal policy, which can be of use as an input to downstream tasks like model selection. Prior work in contextual bandits has considered this in the Gaussian setting. Here, we study the problem of approximating the optimal policy value in the more general linear contextual bandit problem, and we focus on whether it is possible to do so with less data than what is needed to learn the optimal policy. We consider two objectives: (1) estimating upper bounds on the value and (2) estimating the value directly. For the first, we present an adaptive upper bound that is at most logarithmic factor larger than the value and tight when the data is Gaussian and show that it is possible to estimate this upper bound in $\widetilde{\mathcal{O}}( \sqrt{d} )$ samples where $d$ is the number of parameters. As a consequence of this bound, we show improved regret bounds for model selection. For the second objective, we present a momentbased algorithm for estimating the optimal policy value with sample complexity $\widetilde{ \mathcal{O}}( \sqrt{d} )$ for subGaussian context distributions whose low order moments are known.

Jonathan Lee · Weihao Kong · Aldo Pacchiano · Vidya Muthukumar · Emma Brunskill 🔗 


A Short Note on the Relationship of Information Gain and Eluder Dimension
(Poster)
Eluder dimension and information gain are two widely used methods of complexity measures in bandit and reinforcement learning. Eluder dimension was originally proposed as a general complexity measure of function classes, but the common examples of where it is known to be small are function spaces (vector spaces). In these cases, the primary tool to upper bound the eluder dimension is the elliptic potential lemma. Interestingly, the elliptic potential lemma also features prominently in the analysis of linear bandits/reinforcement learning and their nonparametric generalization, the information gain. We show that this is not a coincidence  eluder dimension and information gain are equivalent in a precise sense for reproducing kernel Hilbert spaces. 
Kaixuan Huang · Sham Kakade · Jason Lee · Qi Lei 🔗 


Convergence and Optimality of Policy Gradient Methods in Weakly Smooth Settings
(Poster)
Policy gradient methods have been frequently applied to problems in control and reinforcement learning with great success, yet existing convergence analysis still relies on nonintuitive, impractical and often opaque conditions. In particular, existing rates are achieved in limited settings, under strict smoothness and ergodicity conditions. In this work, we establish explicit convergence rates of policy gradient methods without relying on these conditions, instead extending the convergence regime to weakly smooth policy classes with $L_2$ integrable gradient. We provide intuitive examples to illustrate the insight behind these new conditions. We also characterize the sufficiency conditions for the ergodicity of nearlinear MDPs, which represent an important class of problems. Notably, our analysis also shows that fast convergence rates are achievable for both the standard policy gradient and the natural policy gradient algorithms under these assumptions. Lastly we provide conditions and analysis for optimality of the converged policies.

Shunshi Zhang · Murat Erdogdu · Animesh Garg 🔗 


Almost Optimal Algorithms for Twoplayer Markov Games with Linear Function Approximation
(Poster)
We study reinforcement learning for twoplayer zerosum Markov games with simultaneous moves in the finitehorizon setting, where the transition kernel of the underlying Markov games can be parameterized by a linear function over the current state, both players' actions and the next state. In particular, we assume that we can control both players and aim to find the Nash Equilibrium by minimizing the duality gap. We propose an algorithm NashUCRL based on the principle ``OptimisminFaceofUncertainty''. Our algorithm only needs to find a Coarse Correlated Equilibrium (CCE), which is computationally very efficient. Specifically, we show that NashUCRL can provably achieve an $\tilde{O}(\sqrt{d^2 H^2 T})$ regret, where $d$ is the linear function dimension, $H$ is the length of the game and $T$ is the total number of steps in the game. To assess the optimality of our algorithm, we also prove an $\tilde{\Omega}( \sqrt{d^2 H^2 T})$ lower bound on the regret. Our upper bound matches the lower bound up to logarithmic factors, which suggests the optimality of our algorithm.

Zixiang Chen · Dongruo Zhou · Quanquan Gu 🔗 


Improved Estimator Selection for OffPolicy Evaluation
(Poster)
Offpolicy policy evaluation is a fundamental problem in reinforcement learning. As a result, many estimators with different tradeoffs have been developed, however, selecting the best estimator is challenging with limited data and without additional interactive data collection. Recently, Su et al. (2020) developed a datadependent selection procedure that competes with the oracle selection up to a constant and demonstrate its practicality. We refine the analysis to remove an extraneous assumption and improve the procedure. The improved procedure results in a tighter oracle bound and stronger empirical results on a contextual bandit task. 
George Tucker 🔗 


A Boosting Approach to Reinforcement Learning
(Poster)
We study efficient algorithms for reinforcement learning in Markov decision processes, whose complexity is independent of the number of states. This formulation succinctly captures large scale problems, but is also known to be computationally hard in its general form. We consider the methodology of boosting, borrowed from supervised learning, for converting weak learners into an accurate policy. The notion of weak learning we study is that of sampledbased approximate optimization of linear functions over policies. We give an efficient algorithm that is capable of improving the accuracy of such weak learning methods, till global optimality is reached. We prove sample complexity and running time bounds on our method, that are polynomial in the natural parameters of the problem: approximation guarantee, discount factor, distribution mismatch and number of actions. In particular, our bound does not explicitly depend on the number of states. A technical difficulty in applying previous boosting results, is that the value function over policy space is not convex. We show how to use a nonconvex variant of the FrankWolfe method, coupled with recent advances in gradient boosting that allow incorporating a weak learner with multiplicative approximation guarantee, to overcome the nonconvexity and attain global convergence. 
Nataly Brukhim · Elad Hazan · Karan Singh 🔗 


Learning Stackelberg Equilibria in Sequential Price Mechanisms
(Poster)
We study the problem of the design of simple economic mechanisms for assigning items to selfinterested agents that combine a messaging round with a sequentialpricing stage. The rules of the sequentialpricing stage and in particular the way these rules use messages determines the way the messaging stage is used. This is a Stackelberg game where the designer is the leader and fixes the mechanism rules, inducing an equilibrium amongst agents (the followers). We model the followers through equilibrium play coming from noregret learning, and introduce a novel singleagent Stackelberg MDP formulation, where the leader learns to effect a follower equilibrium that optimizes its objective. We solve this MDP using actorcritic methods, where the critic is given access to the joint information of all the agents. 
Gianluca Brero 🔗 


Refined Policy Improvement Bounds for MDPs
(Poster)
The policy improvement bound on the difference of the discounted returns plays a crucial role in the theoretical justification of the trustregion policy optimization (TRPO) algorithm. The existing bound leads to a degenerate bound when the discount factor approaches one, making the applicability of TRPO and related algorithms questionable when the discount factor is close to one. We refine the results in (Schulman et al., 2015; Achiam et al., 2017) and propose a novel bound that is ``continuous'' in the discount factor. In particular, our bound is applicable for MDPs with the longrun average rewards as well. 
Mark Gluzman 🔗 


Meta Learning MDPs with linear transition models
(Poster)
We study metalearning in Markov Decision Processes (MDP) with linear transition models in the undiscounted episodic setting. Under a task sharedness metric based on model proximity, we propose an algorithm that can meaningfully leverage learning in a set of sampled training tasks to quickly adapt to test tasks sampled from the same task distribution. We propose a biased version of the UCMatrixRL algorithm~\cite{yang2019reinforcement}. The analysis leverages and extends results in the learning to learn linear regression and linear bandit setting to the more general case of MDP's with linear transition models. We study the effect of the bias on single task regret and expected regret over the task distribution. We prove that our algorithm provides significant improvements in the transfer regret for task distributions of low variance and high bias compared to learning the tasks in isolation. We outline and analyse two approaches to learn the bias. 
Robert Müller · Aldo Pacchiano · Jack ParkerHolder 🔗 


The best of both worlds: stochastic and adversarial episodic MDPs with unknown transition
(Poster)
We consider the bestofbothworlds problem for learning an episodic Markov Decision Process through $T$ episodes, with the goal of achieving $O(\sqrt{T})$ regret when the losses are adversarial and simultaneously $O(\log T)$ regret when the losses are (almost) stochastic. Recent work by [Jin and Luo, 2020] achieves this goal when the fixed transition is known, and leaves the case of unknown transition as a major open question. In this work, we resolve this open problem by using the same FollowtheRegularizedLeader (FTRL) framework together with a set of new techniques. Specifically, we first propose a lossshifting trick in the FTRL analysis, which greatly simplifies the approach of [Jin and Luo, 2020] and already improves their results for the known transition case. Then, we extend this idea to the unknown transition case and develop a novel analysis which upper bounds the transition estimation error by the regret itself in the stochastic setting, a key property to ensure $O(\log T)$ regret.

Tiancheng Jin · Longbo Huang · Haipeng Luo 🔗 


Identification and Adaptive Control of Markov Jump Systems: Sample Complexity and Regret Bounds
(Poster)
Learning how to effectively control unknown dynamical systems is crucial for autonomous systems. This task becomes more challenging when the underlying dynamics are changing with time. Motivated by this challenge, this paper considers the problem of controlling an unknown Markov jump linear system (MJS) to optimize a quadratic objective. By taking a modelbased perspective, we consider identificationbased adaptive control for MJSs. We first provide a system identification algorithm for MJS to learn the dynamics in each mode as well as the Markov transition matrix, underlying the evolution of the mode switches, from a single trajectory of the system states, inputs, and modes. Through mixingtime arguments, sample complexity of this algorithm is shown to be $\tilde{\mathcal{O}}(1/\sqrt{T})$. We then propose an adaptive control scheme that performs system identification together with certainty equivalent control to adapt the controllers in an episodic fashion. Combining our sample complexity results with recent perturbation results for certainty equivalent control, we prove that the proposed adaptive control scheme achieves $\tilde{\mathcal{O}}(\sqrt{T})$ regret, which can be improved to $\hat{\mathcal{O}}(\log(T))$ with partial knowledge of the system. Our analysis introduces innovations to handle MJS specific challenges (e.g. Markovian jumps) and provides insights into system theoretic quantities that affect learning accuracy and control performance.

Yahya Sattar · Zhe Du · Davoud Ataee Tarzanagh · Necmiye Ozay · Laura Balzano · Samet Oymak 🔗 


NonStationary Representation Learning in Sequential MultiArmed Bandits
(Poster)
Most of the existing theoretical studies on representation learning are focused on batch tasks. However, in practical decisionmaking scenarios, the learner often observes tasks in a sequential fashion. In such sequential problems, learning good representations becomes more challenging as the underlying task representation may change over time. In this paper, we address nonstationary representation learning in sequential multiarmed linear bandits. We introduce an online algorithm that is able to detect task switches and learn and transfer a nonstationary representation in an adaptive fashion. We derive a regret upper bound for our algorithm, which significantly outperforms the existing ones that do not learn the representation. Our bound provides theoretical insights into problemdependent quantities and reveals the excess regret incurred by representation learning, nonstationarity, and task switch detection. 
Qin Yuzhen · Tommaso Menara · Samet Oymak · ShiNung Ching · Fabio Pasqualetti 🔗 


ValueBased Deep Reinforcement Learning Requires Explicit Regularization
(Poster)
While deep reinforcement learning (RL) methods present an appealing approach to sequential decisionmaking, such methods are often unstable in practice. What accounts for this instability? Recent theoretical analysis of overparameterized supervised learning with stochastic gradient descent shows that learning is driven by an implicit regularizer once it reaches the zero training error regime, which results in “simpler” functions that generalize. However, in this paper, we show that in the case of deep RL, implicit regularization can instead lead to degenerate features by characterizing the induced implicit regularizer in the semigradient TD learning setting in RL with a fixeddataset (i.e., offline RL). The regularize recapitulates recent empirical findings regarding the rank collapse of learned features and provides an understanding for its cause. To address the adverse impacts of this implicit regularization, we propose a simple and effective explicit regularizer for TD learning, DR3. DR3 minimizes the similarity of learned features of the Qnetwork at consecutive stateaction tuples in the TD update. Empirically, when combined with existing offline RL methods, DR3 substantially improves both performance and stability on Atari 2600 games, D4RL domains, and robotic manipulation from images. 
Aviral Kumar · Rishabh Agarwal · Aaron Courville · Tengyu Ma · George Tucker · Sergey Levine 🔗 


Policy Optimization in Adversarial MDPs: Improved Exploration via Dilated Bonuses
(Poster)
Policy optimization is a widelyused method in reinforcement learning. Due to its localsearch nature, however, theoretical guarantees on global optimality often rely on extra assumptions on the Markov Decision Processes (MDPs) that bypass the challenge of global exploration. To eliminate the need of such assumptions, in this work, we develop a general solution that adds \textit{dilated bonuses} to the policy update to facilitate global exploration. To showcase the power and generality of this technique, we apply it to several episodic MDP settings with adversarial losses and bandit feedback, improving and generalizing the stateoftheart. Specifically, in the tabular case, we obtain $O(\sqrt{T})$ regret where $T$ is the number of episodes, improving the $O({T}^{\frac{2}{3}})$ regret bound by Shani et al. (2020). When the number of states is infinite, under the assumption that the stateaction values are linear in some lowdimensional features, we obtain $O({T}^{\frac{2}{3}})$ regret with the help of a simulator, matching the result of Neu and Olkhovskaya (2020) while importantly removing the need of an exploratory policy that their algorithm requires.
When a simulator is unavailable, we further consider a linear MDP setting and obtain $O({T}^{\frac{14}{15}})$ regret, which is the first result for linear MDPs with adversarial losses and bandit feedback.

Haipeng Luo · ChenYu Wei · ChungWei Lee 🔗 


On the Sample Complexity of Averagereward MDPs
(Poster)
In this work we study the sample complexity for solving averagereward Markov decision processes (AMDPs), under a generative model access and mixing time bound on all stationary policies. Given an AMDP with mixing time bound $t_{mix}$ and $A_{tot}$ total stateaction pairs, we present two methods for finding approximatelyoptimal stationary policy, altogether obtaining an upper bound of $\widetilde{O}(A_{tot}t_{mix}/\epsilon^2\cdot\min(t_{mix}, 1/\epsilon))$ in sample complexity. We also provide a sample complexity lower bound of $\Omega(A_{tot}t_{mix}/\epsilon^2)$ oblivious samples. This work makes progress toward designing new algorithms with better sample complexity for solving AMDPs and points to the final open problem of closing the gap with the lower bound.

Yujia Jin 🔗 


Finite time analysis of temporal difference learning with linear function approximation: the tail averaged case
(Poster)
In this paper, we study the finitetime behaviour of temporal difference (TD) learning algorithms when combined with tailaveraging, and present instance dependent bounds on the parameter error of the tailaveraged TD iterate. Our error bounds hold in expectation as well as with high probability, exhibit a sharper rate of decay for the initial error (bias), and are comparable with existing bounds in the literature. 
Gandharv Patil · Prashanth L.A. · Doina Precup 🔗 


MultiTask Offline Reinforcement Learning with Conservative Data Sharing
(Poster)
Offline reinforcement learning (RL) algorithms have shown promising results in domains where abundant precollected data is available. However, prior methods focus on solving individual problems from scratch with an offline dataset without considering how an offline RL agent can acquire multiple skills. We argue that a natural use case of offline RL is in settings where we can pool large amounts of data collected in a number of different scenarios for solving various tasks, and utilize all this data to learn strategies for all the tasks more effectively rather than training each one in isolation. To this end, we study the offline multitask RL problem, with the goal of devising datasharing strategies for effectively learning behaviors across all of the tasks. While it is possible to share all data across all tasks, we find that this simple strategy can actually exacerbate the distributional shift between the learned policy and the dataset, which in turn can lead to very poor performance. To address this challenge, we develop a simple technique for datasharing in multitask offline RL that routes data based on the improvement over the taskspecific data. We call this approach conservative data sharing (CDS), and it can be applied with any singletask offline RL method. On a range of challenging multitask locomotion, navigation, and imagebased robotic manipulation problems, CDS achieves the best or comparable performance compared to prior offline multitask RL methods and previously proposed online multitask data sharing approaches. 
Tianhe (Kevin) Yu · Aviral Kumar · Yevgen Chebotar · Karol Hausman · Sergey Levine · Chelsea Finn 🔗 


Provably Efficient MultiTask Reinforcement Learning with Model Transfer
(Poster)
We study multitask reinforcement learning (RL) in tabular episodic Markov decision processes(MDPs). We formulate a heterogeneous multiplayer RL problem, in which a group of players concurrently face similar but not necessarily identical MDPs, with a goal of improving their collective performance through interplayer information sharing. We design and analyze a modelbased algorithm, and provide gapdependent and gapindependent upper and lower bounds that characterize the intrinsic complexity of the problem. 
Chicheng Zhang · Zhi Wang 🔗 


BadPolicy Density: A Measure of ReinforcementLearning Hardness
(Poster)
Reinforcement learning is hard in general. Yet, in many specific environments, learning is easy. What makes learning easy in one environment, but difficult in another? We address this question by proposing a simple measure of reinforcement learning hardness called the badpolicy density. This quantity measures the fraction of the deterministic stationary policy space that is below a desired threshold in value. We prove that this simple quantity has many properties one would expect of a measure of learning hardness. Further, we prove it is NPhard to compute the measure in general, but there are paths to polynomialtime approximation. We conclude by summarizing potential directions and uses for this measure. 
David Abel · Cameron Allen · Dilip Arumugam · D Ellis Hershkowitz · Michael L. Littman · Lawson Wong 🔗 


CRPO: A New Approach for Safe Reinforcement Learning with Convergence Guarantee
(Poster)
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and meanwhile avoids violation of certain constraints on a number of expected total costs. In general, such SRL problems have nonconvex objective functions subject to multiple nonconvex constraints, and hence are very challenging to solve, particularly to provide a globally optimal policy. Many popular SRL algorithms adopt a primaldual structure which utilizes the updating of dual variables for satisfying the constraints. In contrast, we propose a primal approach, called constraintrectified policy optimization (CRPO), which updates the policy alternatingly between objective improvement and constraint satisfaction. CRPO provides a primaltype algorithmic framework to solve SRL problems, where each policy update can take any variant of policy optimization step. To demonstrate the theoretical performance of CRPO, we adopt natural policy gradient (NPG) for each policy update step and show that CRPO achieves an $\mathcal{O}(1/\sqrt{T})$ convergence rate to the global optimal policy in the constrained policy set and an $\mathcal{O}(1/\sqrt{T})$ error bound on constraint satisfaction. This is the first finitetime analysis of primal SRL algorithms with global optimality guarantee. Our empirical results demonstrate that CRPO can outperform the existing primaldual baseline algorithms significantly.

Tengyu Xu · Yingbin LIANG · Guanghui Lan 🔗 


Sparsity in the Partially Controllable LQR
(Poster)
A fundamental concept in control theory is that of controllability, where any system state can be reached through an appropriate choice of control inputs. Indeed, a large body of classical and modern approaches are designed for controllable linear dynamical systems. However, in practice, we often encounter systems in which a large set of state variables evolve exogenously and independently of the control inputs; such systems are only \emph{partially controllable}. The focus of this work is on a large class of partially controllable linear dynamical system, specified by an underlying sparsity pattern. Our main results establish structural conditions and finitesample guarantees for learning to control such systems. In particular, our structural results characterize those state variables which are irrelevant for optimal control, an analysis which departs from classical control techniques. Our algorithmic results adapt techniques from highdimensional statisticsspecifically softthresholding and semiparametric leastsquaresto exploit the underlying sparsity pattern in order to obtain finitesample guarantees that significantly improve over those based on certaintyequivalence. We also corroborate these theoretical improvements over certaintyequivalent control through a simulation study. 
Yonathan Efroni · Sham Kakade · Akshay Krishnamurthy · Cyril Zhang 🔗 


Provable Benefits of ActorCritic Methods for Offline Reinforcement Learning
(Poster)
Actorcritic methods are widely used in offline reinforcement learning practice but are understudied theoretically. In this work we show that the pessimism principle can be naturally incorporated into actorcritic formulations. We create an offline actorcritic algorithm for a linear MDP model more general than the lowrank model. The procedure is both minimax optimal and computationally tractable. 
Andrea Zanette · Martin Wainwright · Emma Brunskill 🔗 


Solving MultiArm Bandit Using a Few Bits of Communication
(Poster)
The multiarmed bandit (MAB) problem is an active learning framework that aims to select the best among a set of actions by sequentially observing rewards. Recently, it has become popular for a number of applications over wireless networks, where communication constraints can form a bottleneck. Yet existing works usually fail to address this issue and can become infeasible in certain applications.
In this paper, we propose QuBan, a generic reward quantization algorithm that applies to any (noregret) multiarmed bandit algorithm. The modified algorithm requires on average a few (as low as $3$) bits to be sent per iteration, yet preserving the same regret as the original algorithm.
Our upper bounds apply under mild assumptions on the reward distributions over all current (and future) MAB algorithms, including those used in contextual bandits.
We also numerically evaluate the application of QuBan to widely used algorithms such as UCB and $\epsilon$greedy.

Osama Hanna · Lin Yang · Christina Fragouli 🔗 


Implicit FiniteHorizon Approximation for Stochastic Shortest Path
(Poster)
We introduce a generic template for developing regret minimization algorithms in the Stochastic Shortest Path (SSP) model, which achieves minimax optimal regret as long as certain properties are ensured. The key of our analysis is a new technique called implicit finitehorizon approximation, which approximates the SSP model by a finitehorizon counterpart only in the analysis without explicit implementation. Using this template, we develop two new algorithms: the first one is modelfree (the first in the literature to our knowledge) and minimax optimal under strictly positive costs; the second one is modelbased and minimax optimal even with zerocost stateaction pairs, matching the best existing result from (Tarbouriech et al., 2021b). Importantly, both algorithms admit highly sparse updates, making them computationally more efficient than all existing algorithms. Moreover, both can be made completely parameterfree. 
Liyu Chen · Mehdi Jafarnia · Rahul Jain · Haipeng Luo 🔗 


On the Theory of Reinforcement Learning with OnceperEpisode Feedback
(Poster)
We study a theory of reinforcement learning (RL) in which the learner receives binary feedback only once at the end of an episode. While this is an extreme test case for theory, it is also arguably more representative of realworld applications than the traditional requirement in RL practice that the learner receive feedback at every time step. Indeed, in many realworld applications of reinforcement learning, such as selfdriving cars and robotics, it is easier to evaluate whether a learner's complete trajectory was either "good" or "bad," but harder to provide a reward signal at each step. To show that learning is possible in this more challenging setting, we study the case where trajectory labels are generated by an unknown parametric model, and provide a statistically and computationally efficient algorithm that achieves sublinear regret. 
Niladri Chatterji · Aldo Pacchiano · Peter Bartlett · Michael Jordan 🔗 


SampleEfficient Learning of Stackelberg Equilibria in GeneralSum Games
(Poster)
Real world applications such as economics and policy making often involve solving multiagent games with two unique features: (1) The agents are inherently \emph{asymmetric} and partitioned into leaders and followers; (2) The agents have different reward functions, thus the game is \emph{generalsum}. The majority of existing results in this field focuses on either symmetric solution concepts (e.g. Nash equilibrium) or zerosum games. It remains vastly open how to learn the \emph{Stackelberg equilibrium}an asymmetric analog of the Nash equilibriumin generalsum games efficiently from samples. This paper initiates the theoretical study of sampleefficient learning of the Stackelberg equilibrium, in the bandit feedback setting where we only observe noisy samples of the reward. We consider three representative twoplayer generalsum games: bandit games, banditreinforcement learning (banditRL) games, and linear bandit games. In all these games, we identify a fundamental gap between the exact value of the Stackelberg equilibrium and its estimated version using finitely many noisy samples, which can not be closed informationtheoretically regardless of the algorithm. We then establish sharp positive results on sampleefficient learning of Stackelberg equilibrium with value optimal up to the gap identified above, with matching lower bounds in the dependency on the gap, error tolerance, and the size of the action spaces. Overall, our results unveil unique challenges in learning Stackelberg equilibria under noisy bandit feedback, which we hope could shed light on future research on this topic. 
Yu Bai · Chi Jin · Huan Wang · Caiming Xiong 🔗 