POLITEX: Regret Bounds for Policy Iteration using Expert Prediction
Yasin Abbasi-Yadkori · Peter Bartlett · Kush Bhatia · Nevena Lazic · Csaba Szepesvari · Gellért Weisz

Thu Jun 13th 11:35 -- 11:40 AM @ Room 102

We present POLITEX (POLicy ITeration using EXperts), a model-free reinforcement learning (RL) algorithm that uses linear function approximation for continuing RL problems. POLITEX can be thought of as a “soft” variant of policy iteration, where the policy in each iteration corresponds to a Boltzmann distribution over the sum of previous action-value functions. We show that in uniformly mixing Markov Decision Processes (MDPs), for a time-horizon of T and a worst-case value function approximation error ε where linear function approximation is used with d-dimensional features, the regret of POLITEX scales as O˜(d^(1/2)T^(3/4) + εT). Under a uniform mixing assumption, we provide the first regret result for a practical model-free method that uses function approximation and where the regret does not scale with the size of the underlying MDP. We also provide a new finite sample analysis of the LSPE algorithm, used by POLITEX to estimate the value functions, which may be of independent interest. Experimental results on a queuing problem confirm that POLITEX is competitive with some of its alternatives, while preliminary results on Ms Pacman (one of the standard Atari benchmark problems) confirm the viability of POLITEX beyond linear function approximation.

Author Information

Yasin Abbasi-Yadkori (Adobe Research)
Peter Bartlett ("University of California, Berkeley")
Kush Bhatia (UC Berkeley)
Nevena Lazic (Google)
Csaba Szepesvari (DeepMind/University of Alberta)
Gellért Weisz (DeepMind)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors