Oral
Optimistic Policy Optimization via Multiple Importance Sampling
Matteo Papini · Alberto Maria Metelli · Lorenzo Lupo · Marcello Restelli
Abstract:
Policy Search (PS) is an effective approach to Reinforcement Learning for solving
control tasks with continuous state-action spaces. In this paper, we address the exploration-exploitation trade-off in PS by proposing an approach based on Optimism in Face of Uncertainty. We cast the PS problem as a suitable Multi Armed Bandit problem, defined over the policy parameter space, and we propose a class of algorithms that effectively exploit the problem structure, by leveraging Multiple Importance Sampling to perform an off-policy estimation of expected return.
We show that the regret of the proposed approach is bounded by $\widetilde{\mathcal{O}}(\sqrt{T})$ for both discrete and continuous parameter spaces. Finally, we evaluate our algorithms on tasks of varying difficulty, comparing them with existing MAB and RL algorithms.
Chat is not available.
Successful Page Load