Skip to yearly menu bar Skip to main content


Scalable Safe Policy Improvement via Monte Carlo Tree Search

Alberto Castellini · Federico Bianchi · Edoardo Zorzi · Thiago Simão · Alessandro Farinelli · Matthijs T. J. Spaan

Exhibit Hall 1 #644


Algorithms for safely improving policies are important to deploy reinforcement learning approaches in real-world scenarios. In this work, we propose an algorithm, called MCTS-SPIBB, that computes safe policy improvement online using a Monte Carlo Tree Search based strategy. We theoretically prove that the policy generated by MCTS-SPIBB converges, as the number of simulations grows, to the optimal safely improved policy generated by Safe Policy Improvement with Baseline Bootstrapping (SPIBB), a popular algorithm based on policy iteration. Moreover, our empirical analysis performed on three standard benchmark domains shows that MCTS-SPIBB scales to significantly larger problems than SPIBB because it computes the policy online and locally, i.e., only in the states actually visited by the agent.

Chat is not available.