We thank the reviewers for their constructive comments.$
Reviewers 1 &amp; 3 write that our work may seem incremental with respect to earlier work on single-shot problems. We would like to note that results from single-shot problems do not necessarily translate into sequential decision problems, which are substantially more difficult, with complex dependencies between the policy, return obtained, and the decision problems experienced along the trajectory. No earlier work has discussed the potential utility of these regularities for solving sequential decision problems. Ours is the first. 


Reviewer 1

Lower scores are due to two differences in implementation. We use only 16 rows of the board for placement (following Bertsekas 1996) while the original BCTS work used all 20 rows, effectively using an outside area for displaying/rotating/translating the pieces. And, we do not filter out actions that immediately end the game.


Reviewer 3

We would state the takeaway message from our paper as follows: Sequential decision problems show some regularities that no learning/planning algorithm is yet deliberately taking advantage of. These regularities are fundamentally different from those that are already known. This finding is significant because solving sequential decision problems fast (using few samples and little computation) remains a challenge. Our paper introduces novel directions for algorithm development and a rich set of questions for further investigating these regularities.

Action pruning is one possible use of these regularities, one we explore in the paper. In both games, a human player would immediately eliminate most alternatives as inferior. But this is different from being able to articulate a computational mechanism for doing so effectively and consistently, which is what we do here. Our finding is intuitive, but we believe that this is so only when considered after the fact. We did not expect it when we started this work. 

Another possible use is in prioritizing learning samples. These regularities may be helpful in selecting samples that are more useful than others, and help develop algorithms with reduced sample/computational complexity (and for active learning). 

The features used for exploiting these regularities need not be those that are used for learning a value-function/policy. This introduces a a rich set of research directions to explore, including how such feature sets may be learned from experience. 

These are just a few examples of possible research directions. We expect that different researchers will see different ways of moving forward. These regularities truly introduce a new perspective. 


Reviewer 4

1) The regularities we discuss are fundamentally different from those that are widely studied in machine learning. There is, however, some synergy and we would be happy to discuss them explicitly in the paper.

-- Margin may be arbitrarily small or large, regardless of the dominance relationship between the two alternatives. But we may generally expect the following: when one alternative dominates the other, each feature contributes positively to the margin, resulting in a relatively high margin. How widely this happens is an empirical question and will vary between problems. In Tetris, we found a strong effect: mean margin is 24.5 when one action dominates another, 14.7 otherwise.

-- Action gap may be arbitrarily small or large, regardless of the dominance relationship. But, if the evaluation function is a good estimate of action values, we may expect a higher action gap when one action dominates another. We could not analyze Tetris because we do not know the action values (the evaluation function gives us a good policy but does not estimate action values well). 

-- Sparsity generally supports all three regularities because it automatically satisfies some of the constraints that must hold.

-- We assume a linear decision surface. Simple dominance may be extended to more complex decision surfaces; in that case, we expect smoothness to be irrelevant.

2) We only need simple properties of the evaluation function (signs, order), not the full set of weights. One possibility is for the human designer to use domain knowledge to identify these properties. For instance, in Tetris, any human player would correctly identify the signs and (at least partially) order the features. But it should also be possible to learn them from experience, and far more quickly than the full evaluation function.
 
3) The observed behavior is not a universal property of linear functions. One property that supports such behavior is high correlation among the features. If features are independent, simple dominance and noncompensatoriness are very unlikely to hold. For cumulative dominance, a detailed set of results are available in Baucells et al. (2008).