Paper ID: 1043
Title: Stratified Sampling Meets Machine Learning

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper introduces an approach to learn the query-optimal subsampling probabilities for a databases based on previous queries. The learning problem appears as a unusual regression problem, with a regularization related to the sampling budget. The regression is solved with a very simple algorithm. The authors perform convincing experiments on 3 different databases with a set of training queries.

Clarity - Justification: I found the paper very easy to follow.  In a sense, the discussion on machine learning are sometimes too didactic, given that this is ICML, and maybe the database-specific problems could be made more didactic.

Significance - Justification: The learning framework is simple, but, as far as I know, it touches a problem that hasn't been much studied by the learning community, and is of interest to that community.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I haven't much negative criticism to give. I liked the paper.

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): Given a database of records, the goal is to find a sampling mechanism of records to accurately answer queries with a budgeted subset of records. - the sampling probabilities are not query dependent - the assumption is that the distribution of queries is the same between the training set and the test set.   The paper proposes to find the sampling probabilities by minimizing the empirical regret of using a sampled subset for estimating the query output versus using the whole record database. To avoid the trivial solution that takes all samples a first constraint is added: an average budget constraint. To avoid over-fitting on the training queries, the proposed approach is to use defensive sampling -- i.e. ensuring that any item as a minimal probability of being sampled.  They propose an algorithm to solve this minimization and test the method on several datasets.

Clarity - Justification: I'm not familiar enough with the setting to judge the relevance / exhaustivity of the related work, but the problem is well-presented and motivated.  From the theoretical point of view, I would have appreciated to have a discussion about the links with importance sampling. If I'm not making a mistake, for a single query, the methods boils down to finding the importance sampling distribution that minimizes the variance of the estimation under budget constraint. In some sense, the methods aims at finding the optimal distribution of a multi-criteria importance sampling problem (under constraints)  In terms of experimental justification, it could have been more convincing to have stronger baselines, such as the cited related work Joshi and Jermaine 2008. 

Significance - Justification: I think the problem is of impact and the proposed solution is simple and efficient.   However, I would have appreciated a discussion about the multi-task aspect of the problem (each query can be seen as a different task) and the possibility to use a different aggregation criterion than the average reconstruction. For instance one could want to optimize for the worse reconstruction over the queries: would the optimal algorithm fundamentally differs from the one presented here ?  Having a discussion or some experiments about how the sampling distribution optimal for a set of queries differs from the optimal pointwise distributions of each query depending on the similarity of the queries.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall, I liked the paper, but I think it could have a more complete discussion about the setting and a better experiment comparison with the existing work (see paragraph below)  As a remark, for step 6 of the algorithm (the linesearch), isn't it possible to do better than a brute linesearch by reformulating it as a minimization under constraint. It looks like it could be solve using some iterative projection.

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): See below

Clarity - Justification: See below

Significance - Justification: See below

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This paper proposes a sampling approach which could minimize the empirical risk of queries. The work tries to improve query efficiency via sampling. Previous works showed that uniform sampling can achieve the smallest error in the worst case. This work assumes that the future queries and the historical queries have the same distribution. Why is this assumption justified in the case of these settings?  Is this realistic and does this reflect realistic workloads from a real-world scenario.  Given this assumption, this work tries to find a sampling distribution which could minimize the squared error of the queries.   The problem is formulated as an optimization problem where the objective function is the squared loss function and the regularizer is the sampling/storage cost. The authors gave a closed form solution to the optimizer by solving the dual Lagrangian. And they provided an upper bound for the generalization risks. They developed a one-pass algorithm to obtain the sampling probability on the training set.  They compared their algorithm with three baselines on several datasets. Experiments showed  that their algorithm can achieve the smallest errors on the test set. The authors also defined "hard query" and "easy query" in terms of the cardinality of the querying result. They showed that their algorithm has better performance than uniform sampling for "easy queries".  It would have been nice if they had looked at and compared with the more recent work in the database literature. I list several here (some of which are cited but not compared with as noted previously):  Specifically for cube queries I would encourage them to look at comparing with some of the ideas in BlinkDB (which they cite) and DICE (Kamat, Nandi et al) ICDE 2014 and work by Jin et al ICDE 2006. Also see work by Wang et al for the role of stratified data placement (including optimal allocation -- a generalization of Neyman allocation) in ICDE 2013.  More broadly comparisons with the work by Jermaine (Rice) et al (one work is cited and discussed in the rebuttal but there are others), Haas et al (IBM Almaden) as well as generalizations of Chaudhury et al's work from MSR is warranted.      It would have been nice if they had looked at a realistic workload in lieu of manufacturing data and/or queries (CUBE and DBLP). They have results on one dataset from Yahoo (it is not clear if this is publicly available for reproducability (a stated goal by the authors in the rebuttal.   Comments:  1. More details of all the formulas/equations are expected, perhaps as appendix or supplementary materials. Many of them are not well defended. 2. It seems that the training time is linear in terms of the size of the raw dataset. Comparison among baselines can be shrunk and the time cost in seconds/hours can be added. 3. The work does not seem to consider data dependency. Hence parallelized querying seems to be a natural way to improve the efficiency, how does this work compared with some simple parallel solution?

=====