Skip to yearly menu bar Skip to main content

Workshop: Workshop on Reinforcement Learning Theory

Finite-Sample Analysis of Off-Policy TD-Learning via Generalized Bellman Operators

Zaiwei Chen · Siva Maguluri · Sanjay Shakkottai · Karthikeyan Shanmugam

Abstract: Policy evaluation (including multi-step off-policy importance sampling) has the interpretation of solving a generalized Bellman equation. In this paper, we derive finite-sample bounds for any general off-policy TD-like stochastic approximation algorithm that solves for the fixed point of this generalized Bellman operator. Our key step is to show that the generalized Bellman operator is simultaneously a contraction mapping with respect to a weighted $\ell_p$-norm for each $p$ in $[1,\infty)$, with a common contraction factor. Our results immediately imply finite-sample bounds of variants of off-policy TD-learning algorithms in the literature (e.g. $Q^\pi(\lambda)$, Tree-Backup, Retrace, and $Q$-trace).

Chat is not available.