Paper ID: 628
Title: On the Analysis of Complex Backup Strategies in Monte Carlo Tree Search

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper empirically investigates which kind of value propagation is most appropriate in MCTS under bounded computational resources: back-propagating a weighted combination of instantaneous rewards and values along the path followed, as opposed to, the cumulative reward gathered along the path. It conjectures (and confirms via complementary experiments) that for "needle in the haystack" problems, it is more beneficial to back-propagate the max, as opposed to, the average cumulative value, and that medium weights (moderate lambda) are beneficial.  

Clarity - Justification: see below

Significance - Justification: see below

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Reading notes.  Expansion. l. 142 and seq are a bit misleading:  In the standard setting of large problems (with no shortage of planning simulations), a child node is added only after its parent node has been visited a few times.   UCB1 is the most popular; but not the optimal criterion for action selection. See  The KL-UCB Algorithm for Bounded Stochastic Bandits, A. Garivier and O. Cappe, COLT 2011.   I cannot find where you explain how C is adjusted in the UCB criterion.  The scientific question considered in the paper is very interesting; it has been discussed in the RL framework, e.g. suggesting that one might be better off by considering a gamma value lower than specified during the learning of the optimal policy: see, Biasing Approximate Dynamic Programming with a Lower Discount Factor, Petrik Scherrer NIPS 08.  The experiments, Fig. 2, are very interesting. I understand that MaxMCTS is nearly always better in the considered application domain, with small horizon and low computational budget.  It's interesting to see that in a similar setting, maximizing the min also  yields excellent results even better than UCT (except for optimally calibrated C): see  N. Galichet, M. Sebag, O. Teytaud: Exploration vs Exploitation vs Safety: Risk-Aware Multi-Armed Bandits. ACML 2013.  The reason seems to be that that the empirical convergence of the max or min is faster than the average (e.g. exponential as opposed to Hoeffding).    The fact that MaxMXTS(lambda) is even much better -- when lambda is optimally tuned -- is over-interpreted imo.   Suggestion. It might be interesting to consider a nested optimization problem, achieving the self-adjustment of lambda.

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): Monte Carlo Tree Search (MCTS) relies on backups with the full return to estimate the quality of state-action combinations. From the reinforcement learning literature it is well known that other update targets than the full return can often result in faster learning. In this paper, the authors investigate whether using alternative update targets can also benefit MCTS. The authors propose 4 different variations of the update target and perform an extensive empirical comparison using planning domains from the IPC.

Clarity - Justification: The paper is very well written and easy to follow.

Significance - Justification: Looking at different backup strategies in MCTS is a relevant topic and makes a lot of sense. One thing missing from the paper is a comparison of the computational cost of the different backup strategies. A computational comparison is important for a proper evaluation, because a more expensive backup strategy might mean that less simulations/trajectories can be performed, given the same time budget.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Planning methods are typically evaluated by the solution quality they obtain within a certain time limit. Hence, one of the main trade-offs to consider is the quality of a heuristic versus how expensive that heuristic is to compute. In the context of MCTS, having better quality backups does not necessarily mean having better overall performance: if the backups are more expensive, it could mean that less trajectory roll-outs can be performed in the same amount of time.   That the computational cost is not discussed/taken into account at all I see as a serious flaw of the paper. It could be the case that the computational cost of the backups is negligible with respect to the cost of generating the simulations, so the cost of the backups can be ignored. However, if this is the case this should be clearly mentioned.  Furthermore, the conclusion that MaxMCTS(lambda) outperforms all other proposed and existing MCTS variants is only true when lambda is optimized for each domain. When lambda is not optimized, the performance can be a lot worse than regular MCTS. This makes MaxMCTS(lambda) less interesting for a general purpose planner.  Overall, this is a great paper. However, the evaluation of the methods is incomplete without discussing the computational cost.

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper studies complex backup strategies for MCTS in finite horizon probabilistic planning problems. It develops several MCTS backup variants that correspond to popular RL backup schemes. The paper shows that some of these backup schemes do better than the standard averaging.

Clarity - Justification: The paper is clear for the most part but assumes a rather large degree of familiarity with both areas RL and MCTS.

Significance - Justification: It is nice to see this combination of RL backups and MCTS explored in systematic experimental work. There were a few earlier papers (referenced here) that tried a variety of backup schemes, but this one seems less ad hoc by choosing standard RL ideas.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): getNode = getOrInitNode ??? Line 199 did not understand the conditions, why not just 1/K ? I am not convinced that uniform random is best for comparing algorithms. I am not convinced the insights carry over directly to more informed action selection, since it changes the shape of the tree. E.g. consider the limit of perfect action selection. Lots of typos should be fixed. Citation style varies widely and should be unified.

=====