Paper ID: 809 Title: Why Most Decisions Are Easy in Tetris---And Perhaps in Other Sequential Decision Problems, As Well Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper describes an analysis of the solution of decision problems. While previous related work was considering one-step decision problems, this article focuses on sequential decision problems. The article looks at several "conditions" that allow to compare different candidate solutions (in this case they are usually called policies) by aggregating the features they are based on. The main contribution is the empirical observation that good policies can almost be discriminated (or at least that most bad actions can be eliminated) by simply looking at a few number of such "conditions". This suggests that problems we want to apply machine/reinforcement learning to may have a specific structure, which may be exploited algorithmically to improve current optimization algorithms. The paper focuses on the Tetris domain, and Clarity - Justification: I did not know anything about the state of the art of the analysis of decision rules through "conditions". The authors make a very good job to explaining this to a neophyte like me. Significance - Justification: This paper is special in the sense that most of it does not contain any new analysis, or really new algorithm. It essentially presents an empirical analysis of some approximate solutions to some sequential decision problems. With this respect, it is a bit different from usual ICML submissions. This being said, on the positive side, - the authors make a good job making their ideas accessible to the machine/reinforcement learning communities - the ideas presented may open some interesting new directions in reinforcement learning. On the negative side, - the work may seem a little incremental with respect to previous related works on single-shots problem (Simsek et al, NIPS 2015). Overall, I like the ideas presented, and I think that applying this analysis to a popular benchmark problem like Tetris highlights some interesting facts. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper and its contributions are clear. I don't have any major comment. Here are some minor ones. 1) I was a bit surprised by the performance of the Tetris controllers. In the literature, rather old works provide average scores in the ten thousands, and the best (BCTS) is claimed to do million of lines. This is probably not really fundamental to the message of your paper, but how can you explain the difference with your own implementation (median of 30,474) ? 2) A few words on the meaning of prevalence may be useful to the reader (at least it helped me to find a definition) Line 38: different than -> different from Lines 50 and 242: "Tetris, one of the most well-known and liked video games of all time". You probably don't need to oversell Tetris. As far as ICML is concerned, it has been a benchmark for many algorithms and several competitions ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper studies the question of whether certain simple heuristics can accurately lead to the same decisions as a much stronger BCTS Tetris playing agent does. These heuristics are simple rules that potentially make “correct” decisions without using all the information that is available to the “optimal” decision maker. For example in the case of “comparison problem”, which can be seen as a classification problem, the ground truth is a linear function of some features with known weights. Can we still make the same decision as this ground truth without actually knowing the weights and the features exactly? Depending on what we know and what we do not, different notions such as simple/cumulative dominance and noncompensation may be defined. For instance, if a decision problem has the property of simple dominance, we do not need to know the weights or the exact value of features to make the correct decision. It is enough to know whether all features of one class is greater or equal to the other one (with at least one of them being strictly greater than the other one). This is a property of the decision problem. In some previous work (e.g., Simsek, “Linear Decision Rule as Aspiration for Simple Decision Heuristics,” NIPS, 2013), this was studied for the comparison problem. Here the same question has been studied for a sequential decision making problem with a focus on Tetris. The way it is done is based on the a certain Tetris agent-playing’s evaluation function (BCTS by Thiery and Scherer, 2009). I am not familiar with that work, but I believe this evaluation function can be thought of as the close to optimal action-value function of the game. So all the previous concepts from the comparison problem, which was based on the linear function of features, can be brought to studying the structure of the value function too. Not surprisingly, the paper shows that these structures actually exist in that evaluation function, especially if the state distribution comes from an agent that plays the game well (either BCTS or humans). Clarity - Justification: The paper is clearly written, but the connections to the rest of machine learning, especially various regularities that are already known, is not clarified at all. Significance - Justification: The idea behind the paper is interesting, but I am not sure if it will lead to any significant insight or new possibilities to design new algorithms. Please read my detailed comments. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This is an interesting paper. It approaches the study of structure or regularity of the decision problem from an angle that might be new to many in the ICML community. So in this sense, the paper seems to be novel as much as I am aware. But I have several concerns and questions: 1) Since ICML is a machine learning venue, and not a cognitive science one, I hoped to see a bit more connection with what we already know in this community about the regularities of the problem. So my first issue is that why we don’t see much connection here. These are some questions that are all related to this issue: - What is the connection between dominances and non-compensation type of regularity of the decision problem with regularities such as the margin in the classification problem, sparsity of weight vector, smoothness of the decision surface (either for the classifier or value function), as well as the action-gap regularity of the sequential decision-making problem. - It appears that the concept of dominance (both of them) are essentially about the margin of the classification problem. Is it the case? - Non-compensation seems like a property of how fast the weights decay towards zero (after being sorted). This seems like an approximate sparsity property of the weights (or maybe more accurately, being a weak lp sequence). - We have this concept of action-gap regularity for sequential decision making problems. It talks about how close the action-value of different actions are, and what its effect is on the decision making. This is also related to the low-noise margin condition of classification. The concept of action-gap might be very related to this work. What are the connections? I believe the paper should make these connections explicit. These are some regularities that machine learning researchers are more familiar with. Making an attempt to connect with these concepts would be very valuable, and without that, this work would somehow be disconnected from the rest of the field. 2) The discussion on how to actually benefit from these properties (Section 6.2) is limited. If I understand correctly, these properties (e.g., Pareto-simple and Pareto-cumulative) are used to obtain new samples. But then if we do not actually know the evaluation function (which is the action-value function), how can we generate samples according to these Pareto-* properties? Is it possible to design an estimator that can actually benefit from these properties, if they exist? For instance, if a decision problem has cumulative dominance property, it is enough to know the order in which feature weights decrease. Can an estimator be designed to explicitly only find the order of the weights and not the exact values? 3) Is the observed behaviour a universal property of any linear function with weights and features coming from a vast range of probability distributions? What is special about the distributions generating the weights and features that lead to these properties? To be more accurate, suppose instead of weights in Table 1, we choose weights from an exponential (or Gaussian) distribution. We then sort them based on their magnitude. Afterwards, we sample x’s from another distribution, and verify, for example, the non-compensation property. Is it possible that we actually observe a similar behaviour? ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper asks whether problems in sequential decision-making are easy. The introduction defines "easy" as "choosing well among alternative actions without knowing an evaluation function that scores well in the game", which is somewhat confusing to me. From the paper, I get that easy is more appropriately defined as the optimal (or because this is unknown for the large problems, a "good") action being among a set pruned by simple, cumulative, or lexicographic (implicitly assuming a noncompensatory property for correctness) defintions of dominance. Here the dominance relationships are defined over problem-specific feature vectors. The conclusion is that for Tetris and Backgammon (and specialized feature vectors for each), the discussed pruning methods are highly effective at pruning out suboptimal actions. Hence many sequential decision problems may be easy in the sense that one does not need a full weighting of all features to prune by the various methods (one only needs signs of weights for simple dominance and an additional ordering for cumulative and noncompensatory dominance). Clarity - Justification: Exceptionally well-written. Significance - Justification: Significant for asking an important and thought-provoking question that *may* lead to follow-on research that more directly automates some of the ideas explored. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper is exceptionally well-written and creative in its approach and evaluation w.r.t. expert data... I thoroughly enjoyed reading it and it made me think seriously about questions that had not come to mind before (simply for being unaware of the background literature). However, given that the one-shot case had already been studied in the literature, I'm left wondering what exactly I should take away from this paper for the extension to the sequential case. My major concern is that I have played both Tetris and Backgammon and so looking at the feature vectors, one can immediately see why the dominance relationships in the paper likely imply good action pruning. I worry that the take-home point of the paper is that with good feature engineering, sequential decision-making is easy. This would not be surprising. This paper raises the deeper fundamental questions of - What are properties of features themselves that support effective dominance-based pruning approaches? Can such features be automatically extracted from (a) domain analysis, (b) traces of expert play, or (c) during learning (as a way to speedup learning)? - What problem domains are likely to admit such obvious feature vectors that facilitate effective dominance-based pruning under the different definitions? - Can we more quickly learn the sign of weights (and order) for pruning that we can learn decent weights for action selection? How would we quickly learn just signs? Some of these are already discussed in Future Directions. That said, answering any of these latter questions would help advance the paper from description to prescription and provide a clear path forward for research. For now, my take-away point is that good feature engineering allows efficient pruning techniques for (nearly) optimal sequential decision-making, but it does not actually inform reinforcement learning beyond that somewhat intuitive point. I am somewhat on the border for acceptance at this point and I would appreciate hearing from the authors why (and how) they feel this paper will advance state-of-the-art automated algorithms in sequential decision-making / reinforcement learning? =====