Timezone: »

When is Agnostic Reinforcement Learning Statistically Tractable?
Gene Li · Zeyu Jia · Alexander Rakhlin · Ayush Sekhari · Nati Srebro
Event URL: https://openreview.net/forum?id=WZE4c4hPgH »
We study the problem of agnostic PAC reinforcement learning (RL): given a policy class $\Pi$, how many rounds of interaction with an unknown MDP (with a potentially large state and action space) are required to learn an $\epsilon$-suboptimal policy with respect to (\Pi)? Towards that end, we introduce a new complexity measure, called the spanning capacity, that depends solely on the set (\Pi) and is independent of the MDP dynamics. With a generative model, we show that the spanning capacity characterizes PAC learnability for every policy class $\Pi$. However, for online RL, the situation is more subtle. We show there exists a policy class $\Pi$ with a bounded spanning capacity that requires a superpolynomial number of samples to learn. This reveals a surprising separation for agnostic learnability between generative access and online access models (as well as between deterministic/stochastic MDPs under online access). On the positive side, we identify an additional sunflower structure which in conjunction with bounded spanning capacity enables statistically efficient online RL via a new algorithm called POPLER, which takes inspiration from classical importance sampling methods as well as recent developments for reachable-state identification and policy evaluation in reward-free exploration.

Author Information

Gene Li (TTIC)
Zeyu Jia (Massachusetts Institute of Technology)
Alexander Rakhlin (MIT)
Ayush Sekhari (Cornell University)
Nati Srebro (Toyota Technological Institute at Chicago)

More from the Same Authors