Skip to yearly menu bar Skip to main content


Oral

Optimistic Policy Optimization via Multiple Importance Sampling

Matteo Papini · Alberto Maria Metelli · Lorenzo Lupo · Marcello Restelli

Abstract: Policy Search (PS) is an effective approach to Reinforcement Learning for solving control tasks with continuous state-action spaces. In this paper, we address the exploration-exploitation trade-off in PS by proposing an approach based on Optimism in Face of Uncertainty. We cast the PS problem as a suitable Multi Armed Bandit problem, defined over the policy parameter space, and we propose a class of algorithms that effectively exploit the problem structure, by leveraging Multiple Importance Sampling to perform an off-policy estimation of expected return. We show that the regret of the proposed approach is bounded by $\widetilde{\mathcal{O}}(\sqrt{T})$ for both discrete and continuous parameter spaces. Finally, we evaluate our algorithms on tasks of varying difficulty, comparing them with existing MAB and RL algorithms.

Chat is not available.