Timezone: »

Optimal and instance-dependent oracle inequalities for policy evaluation
Wenlong Mou · Ashwin Pananjady · Martin Wainwright

Linear fixed point equations in Hilbert spaces naturally arise from the policy evaluation problem in reinforcement learning. We study methods that use a collection of random observations to compute approximate solutions by searching over a known low-dimensional subspace of the Hilbert space. First, we prove an instance-dependent upper bound on the mean-squared error for a linear stochastic approximation scheme that exploits Polyak--Ruppert averaging. This bound consists of two terms: an approximation error term with an instance-dependent approximation factor, and a statistical error term that captures the instance-specific complexity of the noise when projected onto the low-dimensional subspace. Using information-theoretic methods, we also establish lower bounds showing that the approximation factor cannot be improved, again in an instance-dependent sense. A concrete consequence of our characterization is that the optimal approximation factor in this problem can be much larger than a universal constant. We show how our results precisely characterize the error of a class of temporal difference learning methods for the policy evaluation problem with linear function approximation, establishing their optimality.

Author Information

Wenlong Mou (UC Berkeley)
Ashwin Pananjady (Georgia Institute of Technology)
Martin Wainwright (UC Berkeley / Voleon)

More from the Same Authors