= Reviewer 4$ "By subtracting an appropriate function from the loss, the authors rewrite the objective function as a sum of a new smooth loss function and the corresponding nonconvex regularizer." -There are some misunderstandings by the reviewer: 1) a function is *added* to the loss 2) the new objective is a sum of the transformed loss and transformed *convex* regularizer "surprising that this has not been suggested previously" -As far as we know, this has never been mentioned/suggested, and we are the first to propose this. -The most closely related is "difference of convex" procedure, which is still very different from the proposed method (sec3.1). "Prop 3.3 is fairly straightforward to prove" -We do not claim that it is difficult to prove. Instead, the main novelty is on making the observation in proposition 3.1, which then leads to the proposed transformation and proposition 3.3. "remaining sections simply demonstrate applications of the nonconvexity redistribution method" -The paper will be incomplete without demonstrating how the proposed transformation can be used in various nonlinear optimization scenarios. -Besides demonstration, in sec3.3, we also modified the FW algorithm and provided a convergence proof for this nonconvex setting. -In sec3.4, we explained why the proposed method is superior to a recent NIPS15 paper. "overall idea of nonconvexity redistribution seems very simple and natural" -We thank the reviewer for pointing out this important advantage. "ideas should be developed a bit further" -This paper is on solving nonconvex problems of the form (1). As in other machine learning optimization papers, we (i) showed how this can done and (ii) provided convergence guarantees. The next step would be to study convergence rate. However, this is an open issue for nonconvex optimization in general. = Reviewer 5 "why this particular choice" -Recall that our main idea is to reuse existing convex solvers. 1) Sec3.2: For proximal algorithms to be efficient, the proximal step needs to be easily computed (in closed form). This is possible only when g is a mixture of norms, and consequently g_i has to be of the form (3). This rules out the reviewer's proposal. 2) Sec3.3: For the key step (9) in FW algorithm to be efficient, it should involve only rank-1 SVD. This is possible only for the nuclear norm, which again requires g_i be of form (3). 3) Sec3.4: The efficient mOWL-QN solver can only handle l_1 regularizer, and again, requires g_i be of the form (3). We will mention these in the final version. "nonconvexity of F may be more pronounced than that of f... How come your algorithm is never hit by that and it generally outperforms?" -We are not aware of theoretical results comparing nonconvexities of different functions -From a theoretical point of view, it is more important to guarantee that the transformed problem has the same critical points as the original problem. We provide this guarantee in Proposition 3.3. -From a practical point of view, we do NOT claim that we can get better prediction performance (empirically, the proposed method is always at least as good as state-of-the-art nonconvex solvers). The key advantage of the proposed method is that it is much faster. = Reviewer 6 "(Condition A3) is assumed...title/abstract/introduction should be properly modified" -These will be modified in the final version. "proposed methodology is not really general...for each nonconvex regularizer, an optimization procedure is hand­crafted" -The proposal is on convexifying nonconvex regularizers. This procedure is general. However, after this transformation, one still has to select an efficient solver for that particular convexified regularizer. In sec3.2, we pick the algorithm in [Yuan et al, 2011]; in sec3.3, we pick the FW algorithm. -To further illustrate its generality, we list some more convexified regularizers and recommended solver: lasso/elastic net: mOWL-QN; overlapping group lasso: proximal gradient [Yuan et al, NIPS-11]; fused lasso: proximal gradient [Liu et al, KDD-10] or ADMM [Yang et al, KDD-13]; network lasso: ADMM [Hallac et al, KDD-15]. More will be listed in the final version. -As mentioned in sec2, the proposed method can be used with any smooth (loss) f and (regularizer) g satisfying condition A3. Note that this covers nonconvex formulations (based on GP,LSP,MCP,Laplace,SCAD) of most regularizers (lasso,sparse group lasso,overlapping group lasso,fused lasso,graphical lasso,low-rank,etc). "Sec3.3, authors said this is an example, but the speeding­up techniques described in this section are quite specific to this example" -Sec3.3.2 is only for matrix completion (as mentioned in the 1st sentence there, and is also used by all baselines in experiment). However, one can use any f in the rest of sec3.3 (e.g., multiclass learning: f(X)=\phi(Y-DX); regression: f(X)=|Y-DX|F_^2; link prediction, etc)