Paper ID: 744 Title: Simultaneous Safe Screening of Features and Samples in Doubly Sparse Modeling Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes an approach to perform safe feature screening and safe sample screening during optimization. Basically it is an proposal to find the optimal feature subset and sample subset to perform learning rather than using the all the features and samples. This paper combines previously developed methods of safe feature screening and safe sample screening to develop a simultaneous safe screening model. Most of the contributions of the paper can be considered as incremental work but overall proposes an important direction in safe screening. This paper gives theoretical analysis of safe feature and safe sample screening and also provide sufficient experiments. Clarity - Justification: The paper is well written. Figures 2 and 3 could be little bigger for easier reading but it is not mandatory. Significance - Justification: The main contribution of the paper is to propose a method to perform safe feature screening and safe sample screening. The proposed method is an incremental work but it is an important contribution to safe screening methodology. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper proposes a simultaneous safe feature screening and safe sample screening method. The proposed method is an incremental work combining previous approaches that has been developed for safe screening. The paper is well written and organized. The paper contains complete proofs of the methods that the authors propose. The paper could be further improved and some of the suggestions are listed below. Though the sage keeping approach described in the paper is understandable, it would be more helpful to the reader if the proposed method can be given in the format of an algorithm. Having the main algorithms (safe keeping) as a summary in algorithmic format would help the reader in implementing the proposed method. What are the issues related to implementation of safe keeping? The overall clarity is very high. I found few minor typos as follows, Section 4, Theorem 5 "solutionx" should be "solution" Section 6.2, 3rd line "letf" should be "left" Appendix A, Section A.3 "the the " should be "the" ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The seminal paper by El Ghaoui & al. (2012) has introduced the so called screening rule to reduce the computational complexity of some machine learning tasks with sparse solutions. The main idea was to use the KKT conditions to identify the inactive features or samples. This lead to a dimension reduction and time saving since we only have to optimize over the active variables. Many improvements have been proposed since then and in this paper, the authors propose to screen simultaneously inactive features (in the primal) and samples (in the dual). They illustrate the practical speed-ups on sparse SVM problems (eg. the smoothed SVM and the epsilon sensitive SVM) with elastic net regularization. The main contributions of the paper are: - Building safe sphere regions in the primal and dual space based on the duality gap (see Fercoq & al 2015, Ndiaye & al 2015), allowing to discard both irrelevant features/samples. A reduction of the radius of GAP spheres is also given. This can (potentially) leads to time saving, though this should be compared / quantified. - Introduction of a "safe keeping" approach that allows to detect variables that could never be removed by the safe. This leads to an interesting stopping criterion for the screening rules, and allows to stop testing variables that could never be eliminated. This might also spare some calculations. Again, this should be compared / quantified. - Safe regions (based on the monotonicity of the sub-gradient) for the SVM plus an l1 regularization are also presented. Clarity - Justification: This is a well written paper. The results are clearly stated and fully proved and numerical simulations illustrate the effectiveness of the proposed method. Though, the authors could clearly display in the theorems, the hypothesis used to build the safe regions, since this changes depending on whether they are build in the primal or dual space, and depending on the regularity of the data fitting term. The hypothesis used to build the primal safe region are quite strong : the strong convexity of the data fitting term does not hold if beta = 0. Hence the safe region used in the primal cannot be applied without the regularization term. The practical efficiency of the safe keeping are not that convincing without further information. The only benefit I see is that it allows a stopping criterion of the screening procedure which can be useful (it allows to dis-activate the screening when it starts to be useless). Note that the safe regions (either in the primal and dual space) proposed for the vanilla hinge loss are not converging in the sense introduced in Fercoq et al. 2015. This limit could be mentioned. The authors do not compare the method with existing screening rule for the SVM. Significance - Justification: This paper introduces two novel ideas that can be useful to other researchers : the simultaneous screening and the safe keeping that allow interesting stopping criterion. The contribution is new and interesting. The potential speed-ups seem also encouraging (factor 2 to 5 w.r.t to no-screening strategies). Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): l234: I would recommend to use \lambda \psi(w) =\lambda \|w\|_1 + \beta \|w\|^2/2, and to keep \beta in all the rules/derivations. Indeed, it is a bit awkward to fix it to 1... l412 I think it is ||X_{Usj}||_2 instead of ||X_{:jUs}||_2. l415 (and other rules): Isn't it a strict inequality in the safe features / samples screening rule? A short proof would be welcome. l474: I believe it should be \sqrt{2 gap/lambda} instead of \sqrt{2 gap}. l490: 'solutionx' - > 'solution' l530 : simiilar' - > 'similar'. - In Fig. 2 and 7 : 'screeing - > 'screening'. l605: the speed-ups benefit of the "safe keeping" should be more detailed/illustrated in Fig. 2. l754 : 'letf' - >'left'. l872 : 'radily' - > 'readily'. l1022 'the the dual optimal': remove one 'the' l1032: same comment as before is X_{Us j} instead of X_{:jUs}. l1057:'the the primal optimal': remove one of 'the'. l1055 and following: More explanations in the idea of the proof of Theorem 6 would be a good help for the reader. l1237: I do not understand why there is \frac{\mu}{n}. Why did you divide by n ? references missing: Similar ideas to the one used for the GAP safe screening rules have been also investigated either in primal Tyler Johnson, Carlos Guestrin. "Blitz: A Principled Meta-Algorithm for Scaling Sparse Optimization.".ICML, 2015. or in dual formulations: Daniel Vainsencher and Han Liu and Tong Zhang. Local Smoothness in Variance Reduced Optimization, NIPS, 2015. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): Main contributions of the paper: - algorithm for simultaneous safe screening of features and "samples" in linear SVM-type method for classification and regression. For this sake the paper considers elastic net penalty (instead of the classical $\ell_2$-norm) for the parameter vector and smoothed large margin loss functions - sake keeping of active features and "samples" - derivation of simultaneous screening schemes for L1-SVM - empirical evaluations of the benefit of the double screening Clarity - Justification: Most of the materials in the paper are clearly exposed: mathematical formulation of screening problems, main theorems defining proposed screening schemes, clear empirical evaluation Significance - Justification: Even though the paper is built upon existing screenig ideas, the simultaneous double screening is sufficiently novel and may help for solving problems involving hinge loss function and features selection penalty. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): - The paper proposes a methodology for simultaneous safe screening of features and "samples" in linear SVM-type method with elastic net sparsity-inducing penalty. For this sake smoothed large margin loss functions instead of classical hinge loss of SVR loss and used. Based on primal and dual sequences the method screens out inactive features and samples and allows to identify safely active features and parameters. The paper also shows that combining the double screening narrows the region of primal and dual parameters. As a side result a similar screening procedure is given for well-known LP-SVM. - Empirical evaluation on large dimensional, large scale datasets show convincing benefit of simultaneous screening over simple screening of features or samples. - Notations $X_{:j \mathcal{U}}$ and $x_{:i \mathcal{U}_f}$ are unclear - Legends of figure 2 are barely readable. Authors should find a better way to highlight their empirical evidences - How look the results of simultaneous screening for LP-SVM problem? Beyond that how the proposed screening approach generalizes to problems with L1-type loss function and L1-type penalty (such as One-Class SVM with feature selection)? =====