Skip to yearly menu bar Skip to main content


Oral

POLITEX: Regret Bounds for Policy Iteration using Expert Prediction

Yasin Abbasi-Yadkori · Peter Bartlett · Kush Bhatia · Nevena Lazic · Csaba Szepesvari · Gellért Weisz

[ ] [ Visit Online Learning ]
[ Slides [ Oral

Abstract:

We present POLITEX (POLicy ITeration using EXperts), a model-free reinforcement learning (RL) algorithm that uses linear function approximation for continuing RL problems. POLITEX can be thought of as a “soft” variant of policy iteration, where the policy in each iteration corresponds to a Boltzmann distribution over the sum of previous action-value functions. We show that in uniformly mixing Markov Decision Processes (MDPs), for a time-horizon of T and a worst-case value function approximation error ε where linear function approximation is used with d-dimensional features, the regret of POLITEX scales as O˜(d^(1/2)T^(3/4) + εT). Under a uniform mixing assumption, we provide the first regret result for a practical model-free method that uses function approximation and where the regret does not scale with the size of the underlying MDP. We also provide a new finite sample analysis of the LSPE algorithm, used by POLITEX to estimate the value functions, which may be of independent interest. Experimental results on a queuing problem confirm that POLITEX is competitive with some of its alternatives, while preliminary results on Ms Pacman (one of the standard Atari benchmark problems) confirm the viability of POLITEX beyond linear function approximation.

Chat is not available.