Paper ID: 181 Title: Adaptive Sampling for SGD by Exploiting Side Information Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes an adaptive scheme for sampling training examples for SGD based methods. Instead of maintaining a distribution over all training examples, which is typically prohibitive computationally, the authors suggest to estimate a distribution over a small number of bins of instances that share the same side information/context (e.g., same class label). Extensive numerical experiments on multiple datasets show that the proposed method yields faster convergence than existing schemes. Clarity - Justification: The paper is easy to follow. The main results are stated clearly and previous work is summarized adequately. The conducted experiments are extensive and were documented appropriately. Significance - Justification: The paper proposes a very simple yet efficient solution for maintaining adaptive sampling distributions for accelerating convergence of SGD-type methods without increasing the computational costs. While the theoretical justification/analysis does not imply a measurable improvement in convergence, the empirical performance of the method is impressive. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The idea is simple but seems very natural and quite effective in practice, even when used in conjunction with other variance-reduction techniques like mini-batching and SVRG. The paper does not contain a theoretical contribution---the analysis is based on a straightforward adaptation of standard importance sampling results and does not imply a measurable improvement. Nevertheless, the simplicity of the method is appealing and its empirical performance is impressive and substantial enough to warrant acceptance. Additional comments: * Line 382: I do not see why does the inequality hold true; please fix the math or provide more details. * Algorithm 1, lines 12-13: this should be executed for all k=1..K. * Line 433: "true distribution" => "uniform distribution"? * When citing a series of papers (e.g., in line 145), please use a single \cite command with multiple arguments. * Eqs (1), (4), (6)-(10), ... : end sentence with a period. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes an adaptive sampling scheme for speeding up stochastic gradient descent algorithm. The key idea is to adaptively sample points leading to less variance in the gradient of SGD. the summary of the algorithm is Divide dataset into bins based on their class label. Sample examples with probability proportional to the L2 Norm of the gradient for the class (strength of the gradient). This cannot be computed exactly so do some cheap approximations based on sampling Update the probability for the bins periodically The authors shows that this scheme converges faster with epochs. Clarity - Justification: The paper is an easy read. Significance - Justification: The authors completely ignored the active learning literature. The whole idea of choosing points having high gradients (or close to hyperplane) are known to lead to faster convergence. The setting of active learning maybe different but it can be naturally used to speed up stochastic gradient descent. See the literature (not cited) 1) S. Tong and D. Koller. Support Vector Machine Active Learning with Applications to Text Classification. In Proccedings of International Conference on Machine Learning, 2000 2) G. Schohn and D. Cohn. Less is More: Active Learning with Support Vector Machines. In Proccedings of International Conference on Machine Learning, 2000. 3) C. Campbell, N. Cristianini, and A. Smola. Query Learning with Large Margin Classifiers. In Proccedings of International Conference on Machine Learning, 2000. 4) Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning NIPS 2010 The ideas that choosing data point with high gradient leads to faster convergence is not new at all. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall the idea of partition based sampling is nice and seems reasonable and its eexcitin to see that it beats ada-grad. However, there are four things that need to be improved in the paper to make it upto the mark of ICML 1. First of all, since the method is doing little extra work (computing p_is etc), therefore the accuracy epoch comparison is not fair (otherwise just compute hessian). The comparison should be on time and accuracy to really see that the method is indeed useful (see other literature) 2. Compare it with other ideas based on active learning for example with Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning NIPS 2010 3. Cite and place this work properly with respect to existing ideas in the active learning literature. 4. Some estimation of how much the cheap way of updating p_i's hurt. Maybe compare it with the exact one (let it be slow) ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper describes a simple adaptive sampling method for stochastic gradient algorithm. The main idea is to partition the data using side information (for instance the target classes). Each round picks one example by first selecting one partition component with adaptive probabilities, then picking one example inside that partition component with uniform probabilities. The partition sampling is compensated by a corresponding importance weight to ensure the scheme is unbiased. The optimal partition sampling scheme minimizes the variance of the stochastic gradient estimates. However it is more efficient to devise an adaptive scheme to estimate the variance of the gradient within each partition and derive the corresponding sampling scheme. This simple idea does not need much theoretical justification (more exactly, the theory is pretty trivial.). Experiments show god results on three datasets. More impressive results would have been obtained with unbalanced classification problems [but this would not have been new by any means]. The authors make a good job at explaining that little should be expected from this scheme is the side information is not a good predictor of the gradient sizes. Clarity - Justification: The paper is very clear. The related work section is misleading. The references in the "momentum methods" section are not references to momentum methods but references to variance reduced methods, which are completely different. The authors claim that they are not aware of methods using adaptive sampling but have a whole section on selective sampling (which indeed are adaptive, with an adaptation scheme with a finer grain.) Adaptive sampling techniques have been proved to be very efficient (e.g. Bordes et al., JMLR 2005, among others). Significance - Justification: The effect that the author describes is certainly real. On example of this is the case of unbalanced classification problems, where it is well known that one should downsample the majority class and correct with importance weights ( although this correction almost perfectly balances the asymmetric loss that is often implied in unbalanced classification.) Given this known technique, what the authors describe is pretty minor... Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): + clarity - questionable novelty - related work section that overstates novelty. ===== Review #4 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper presents variation of stochastic gradient descent based upon non-uniform sampling of test examples. The algorithm centers on adaptive sampling rates for different classes of data, where the class of each examples is defined either by the associated classification label or side information. Experimental results show small improvement over uniform and non-adaptive sampling distributions. Clarity - Justification: The paper is well written and relatively clear. Significance - Justification: The main idea of interest from the paper is exploitation of latent structure based on class label or side information to design adaptive sampling distributions for stochastic gradient descent. The idea appears novel, however there is little evidence to justify this assumption. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The authors present an interesting concept for improving stochastic gradient descent. However, the assumption of latent structure based on class labels or side information is not truly shown on experimental data, with marginal improvement on the ALOI data set, minimal improvement on the CIFAR100 data set, approximately equivalent performance on the IPC and news20, and worse performance on the covertype data set when compared to uniform sampling. Additionally, use and demonstration of utility of side information is not included in the experimental section. =====