Paper ID: 703 Title: Importance Sampling Tree for Large-scale Empirical Expectation Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors consider a subsampling technique for calculating empirical expectations in the context of supervised learning problems. The method relies on the idea not all regions of the feature space will typically be equally informative when training a classifier, and proposes an importance sampling technique to preferentially select more informative points. The proposed technique uses a binary tree, which is manually designed in a problem-specific manner, to partition the feature space in regions in which points will have similar weights in the empirical expectation. Adaptive empirical estimates of the average weight under each internal nodes can then be used to define a proposal distribution that assigns higher probability to points that will more strongly influence the empirical expectation. The technique is evaluated in 4 case studies that cover boosting, variance reduction in SGD during training of neural nets and non-linear SVM prediction. Clarity - Justification: While the writing in generally good and the notation is reasonably precise, there are a number of issues of clarity, particularly in the introduction of the method. - The introduction could do a better job of explaining the core ideas behind the approach: 1. Come up with some way of manually partitioning a set of weighted points using a binary tree 2. Use this tree to approximate a sum over these data points in settings where not all points contribute equally. - The introduction and related work section point out lots of connections, such as to MCTS, but because it is not clear at this stage in the papers what the authors will be doing, these connections are more confusing than helpful. - In section 3.1, the notion of casting various empirical expectations as weighted sums is not very clearly explained. In particular the authors should make it clear that the terms in equations (2)-(4) are factorized in such a manner as to ensure f(n) is a term that can be either positive or negative, but has a magnitude of order 1. - A somewhat confusing aspect of the set-up is that the importance sampling weights w_n / μ(n) are themselves defined in terms of a set of weights w_n. This makes it difficult to parse phrases such as > where s(x) is the sum of the weight estimates w_n / μ_Θ (n; x) - In section 3.3.1, it is not clear what is meant by > We define Φ(x), a function of the upper bound u(x) and lower bound l(x) at node x and we set θ_x to be Φ(c_1(x)) / ... We later learn that Φ(x) is defined in terms of u(x) and l(x), but is a not a function of it. - Section 3 would benefit from some up front discussion about how to design the tree splits. The proposed approach implicitly relies on the fact that one must be able to design a tree in which colocated points are likely to have similar weights. - In section 4.1 it would be helpful to define the linear stumps mathematically. Similarly, I can only partially understand what is meant by > Each stump is trained by sampling uniformly 100 directions in the 2D plan [sic], and optimizing exactly its bias, and its weight in the strong predictor. Significance - Justification: I quite like the main idea presented in this paper. That said, there is some question in my mind as to how general the proposed approach is. On the one hand the authors do a good job of demonstrating that the method can in principle be applied in a range of problems. On the other hand the design of the IST can require a fairly large amount of some problem-specific engineering, particularly in the case of the SVM study. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): - A slightly odd and somewhat confusing property of this method is that while data points are indeed selected according to an importance sampling method at each stage in the training / learning process, the proposal distribution is adapted relative to a non-stationary target. For example in the case of boosting, different subsets of points will have high weights at each iteration, but the empirical estimates in the IST are computed across iterations. I can't immediately see much that could go wrong here, though I suspect there may be a potential for the proposal to overfit on certain regions of the space based on early results in certain cases. Some discussion here would be warranted. - It would be helpful to see a bit more discussion of the complexity of IST. In principle sampling and updating the tree would seem to be log N operations. Is the computational overhead also insignificant in practice? How expensive / cheap is it to cluster the examples in the CIFAR and SVM experiments? - It would be nice to see evaluation of more datasets in the SVM example. The Covertype dataset in particular seems almost hand-picked to be particularly well-suited to this approach, given than only ~100 out of 100k points actually influence the decision surface. - Typo: "in the 2D plan" -> "in the 2D plane" ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper describes a tree-based algorithm to calculate weighted sums using importance sampling. It aims to preferentially sampler "more important" elements in the sum, which in turn leads to more accurate, yet still unbiased, estimators of the sum. The general idea is very standard, the novelty is in the specifics of the algorithm. This assumes we have a tree-structure, and there are two different strategies defined for sampling from the tree. Clarity - Justification: Some of the English could be improved. As an example in the 4th paragraph of Section 2 we have "has gain much", "up the the root" "the next samplings". Also the nature of this sort of contribution is that it produces an algorithm with many different things the user has to choose (how to generate the tree, how to choose the functions which define the sampling mechanism etc.) As such it is hard to know how much the "good" results that are presented are due to "good" choices of these being made (and whether someone taking this method would be able to find such good choices/know how to e.g. choose the tree/know when this method would work well). Significance - Justification: This seems to be a new algorithm to implement a standard "importance sampling" idea. There are good empirical results, but it is not clear how much these depend on choices made in implementing the algorithm (there is no theoretical results guiding implementation). As such it feels a somewhat incremental contribution. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): My main reservations about this paper are contained in the earlier answers. It is a reasonably contribution, but does not feel to quite have any of the novelty/theoretical justification/or compelling empirical evidence of an ICML paper. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors devise a tree based procedure for constructing proposal distributions within importance sampling on large discrete state spaces, with the aim of efficient sub-sampling when computing weighted sums, which is relevant in many machine learning settings (as detailed in the paper with several relevant examples). It is intuitive that the devised procedure focuses on the most important samples for the given problem, and this is demonstrated very clearly in the figures, which I liked. A concern I have is how to construct the tree in general. The authors detail ways for doing this in each example, but they seem to be fairly arbitrary. The authors comment on this in the conclusion, but I wonder whether they have thoughts on how much tree construction can affect the quality of sampling in general? Clarity - Justification: I like the fact that the authors discuss the 3 examples in great detail, and as I've already stated I think the figures tell a nice story. Significance - Justification: Efficient inference using a sub-sample of the data is a relevant problem at the moment, and it looks to me as if the authors have devised a strategy which works better than state of the art approaches in some scenarios, so I think this is a sensible contribution. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Comments by line number 86. Monter-Carlo -> Monte-Carlo =====