Timezone: »

Finite-Sample Analysis of Off-Policy TD-Learning via Generalized Bellman Operators
Zaiwei Chen · Siva Maguluri · Sanjay Shakkottai · Karthikeyan Shanmugam
Policy evaluation (including multi-step off-policy importance sampling) has the interpretation of solving a generalized Bellman equation. In this paper, we derive finite-sample bounds for any general off-policy TD-like stochastic approximation algorithm that solves for the fixed point of this generalized Bellman operator. Our key step is to show that the generalized Bellman operator is simultaneously a contraction mapping with respect to a weighted $\ell_p$-norm for each $p$ in $[1,\infty)$, with a common contraction factor. Our results immediately imply finite-sample bounds of variants of off-policy TD-learning algorithms in the literature (e.g. $Q^\pi(\lambda)$, Tree-Backup, Retrace, and $Q$-trace).

Author Information

Karthikeyan Shanmugam (IBM Research NY)

I am currently a Research Staff Member with the IBM Research AI group, NY since 2017. Previously, I was a Herman Goldstine Postdoctoral Fellow in the Math Sciences Division at IBM Research, NY. I obtained my Ph.D. in Electrical and Computer Engineering from UT Austin in summer 2016. My advisor at UT was Alex Dimakis. I obtained my MS degree in Electrical Engineering (2010-2012) from the University of Southern California, B.Tech and M.Tech degrees in Electrical Engineering from IIT Madras in 2010. My research interests broadly lie in Graph algorithms, Machine learning, Optimization, Coding Theory and Information Theory. In machine learning, my recent focus is on graphical model learning, causal inference and explainability. I also work on problems relating to information flow, storage and caching over networks.