Paper ID: 1190 Title: Efficient Learning with a Family of Nonconvex Regularizers by Redistributing Nonconvexity Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper propose a new way to handle non-convex regularizers and demonstrates its usefulness through numerical experiments. Clarity - Justification: This paper is overall clearly written and easy to follow, although the title/abstract/introduction sounds a bit more overselling than what is actually proposed in this paper (see the detailed comments below for details). Significance - Justification: The use of non-convex regularizers is gathering attention recently and developing an efficient optimization procedure is a significant topic to ICML. However, I have a reservation regarding its generality. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The key idea is to move a non-convex component in the regularizer to a loss function to make the regularizer convex, which shares the same critical points as the original optimization problem. Then standard methods are used to solve the transformed problem. The Frank-Wolfe algorithm is also extended to handle optimization problems with a non-convex loss and a convex regularizer. This paper is overall well-written and interesting. From the title/abstract/introduction, I felt that the method proposed in this paper can handle any non-convex regularizers. However, in the beginning of Section 2, a particular form (Condition A3) is assumed. I understood that it includes many previously proposed regularizers as special cases, but the title/abstract/introduction should be properly modified to clarify this point. Indeed, the proposed methodology is not really general, but for each non-convex regularizer, an optimization procedure is hand-crafted in Section 3.2 and Section 3.3 (in Section 3.3, the authors said this is an example, but the speeding-up techniques described in this section are quite specific to this example). For this reason, although this paper potentially makes an interesting contribution, I made my rating rather moderate. Any feedback is appreciated during the rebuttal phase. Minor comments: In this Section-->In this section Hyphenation is weird: algorith-m, section-s, s-mooth, s-parse, recen-t, etc. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes a method for optimizing composite objective functions involving a smooth "loss" component and nonconvex, nonsmooth "regularizer" component. The assumption on the regularizer is that the difference between the regularization function and the norm of a linear transformation of the argument is smooth; this is satisfied, for instance, with popular nonconvex regularizers such as SCAD, MCP, LSP, GP, etc. By subtracting an appropriate function from the loss, the authors rewrite the objective function as a sum of a new smooth loss function and the corresponding nonconvex regularizer. They then show that the new function has the same critical points (with respect to a decomposition as a difference of convex functions), and develop proximal and Frank-Wolfe-type optimization algorithms for obtaining critical points of the new objective. Clarity - Justification: The ideas in the paper are presented quite clearly, and it is easy to understand the problem setting and the new method proposed by the authors. The background and references are also reasonably thorough. Significance - Justification: Although the details of the mathematical derivations appear correct and rigorous, the overall idea of "nonconvexity redistribution" seems very simple and natural. It is surprising that this has not been suggested previously in the literature on nonconvex regularizers -- perhaps it has, but not in full generality? The key theoretical result seems to be Proposition 3.3, which is fairly straightforward to prove; beyond that, the remaining sections simply demonstrate applications of the nonconvexity redistribution method. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper is well-written, but the ideas should be developed a bit further. Currently, the content of the paper does not seem substantial enough to constitute an ICML paper. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper targets a class of problems in machine learning that minimize the form f(x)+g(x), where: f is nonconvex and smooth g is nonconvex and nonsmooth The paper presents a scheme to rewrite (a certain subset of) problems in this form equivalently as F(x)+G(x), where: F is nonconvex and smooth G is convex and nonsmooth Note that the only difference is turning g from nonconvex to convex. This reformulation allows applying proximal optimization and Frank-Wolfe algorithms to minimize F+G. That is because the proximal operator associated with G can be computed (efficiently) due to the convexity of G. Similar property allows G to be used by the Frank-Wolfe method. The authors clarify that some well-known regularization schemes (Table 1) are amenable to transformation into the form F+G, and thus applicable to proximal algorithms. The paper experiments in a few applications involving such regularization schemes, namely: - Nonconvex Sparse Group Lasso - Nonconvex Tree-Structured Group Lasso - Nonconvex Low-Rank Matrix Completion Empirical results of the paper show speedup compared to the state-of-the-art. Clarity - Justification: The paper is well-written, well-organized, notation is consistent, and experiments are accompanied by visual graphs. Significance - Justification: The problem addressed in the paper is of great significance. A lot of machine learning problems involve minimization of a nonconvex cost function, which is generally more challenging that convex optimization. In particular, nonconvex problems may suffer from slow convergence more often than convex problems (e.g. due to prevalence of saddle points in high-dimensions). Thus, developing algorithms which can speed the nonconvex optimization setups is very crucial, and this paper studies a particular setup in that direction. Specifically, the paper introduces a scheme to transform certain cost functions into an equivalent optimization, but the latter is amenable to more efficient optimization algorithms such as proximal methods and Frank-Wolfe algorithm. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall, I find this paper very interesting and the empirical results are encouraging. However, I have two concerns which I hope the authors could clarify. 1. CHOICE OF THE DECOMPOSITION: As the authors also state in the paper, transforming f+g to F+G is not unique, and the paper presents one particular way to achieve this transformation as prescribed in Section 3.1. My question is, why this particular choice is made? Is there a specific advantage for your proposal against other possibilities for obtaining F+G? For example, I can come up with an alternative way to make g convex, e.g. F2+G2 where F2=f-c||x||^2 and G2=g+c||x||^2 for some large enough positive c. 2. THE NONCONVEXITY OF F MAY BE MORE PRONOUNCED THAN THAT OF f. DOES THAT CREATE A PROBLEM FOR YOUR METHOD?: As stated by the authors, the proposed scheme moves nonconvexity from g into f, so the nonconvexity of F may be much more severe than the original f. This may not always play in favor of your algorithm because after all, you use a local optimization technique (e.g. compute gradient of F) to update the the minimizer. The more nonconvex the F, the more problem it creates for local methods like gradient descent. How come your algorithm is never hit by that and it generally outperforms? Any thoughts or reasoning about that would be illuminating. =====