Paper ID: 370 Title: Primal-Dual Rates and Certificates Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper provides a general method along with approximation guarantees for obtaining an approximate primal solution from an approximate dual solution for regularized empirical risk minimization problems. Clarity - Justification: The paper is clearly written and self-contained. Significance - Justification: Convex optimization problems of the form min_\alpha f(A\alpha) + g(\alpha) are considered where g(\alpha) is the regularization term f(A\alpha) is the loss term. Many empirical risk minimization problems fit this form. The results are basically: When the optimization algorithm has linear convergence rate and g is either strongly convex or has bounded support then this translates into a linear convergence rate for the duality gap. When the objective function is Lipschitz continuous then an O(epsilon)-approximate solution to the dual translates into an O(\sqrt(epsilon))-approximate solution to the duality gap. In some cases it could be even shown that an O(epsilon)-approximate solution to the dual translates into an O(epsilon)-approximate solution for the duality gap. When the Fenchel conjugate g* of the regularization term g is not Lipschitz then the authors provide a method for also dealing with such a case. I find this contribution very useful. The whole idea is basically based on standard duality theory. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I find the idea and the results quite appealing. It makes sense and is useful. Often, ERM problems are solved in the dual, providing only convergence guarantees for the dual solution, but not for the duality gap. Here, a generic framework is given for providing guarantees for the duality gap. A few typos need to be corrected: in general: e.g. -> , e.g., line 106: -> without the necessity of computing an average over the last samples line 363: -> for the duality gap line 382: -> to the dual of an SVM ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes using a duality gap to measure convergence for first-order methods for some standard problems, with a particular interest in coordinate descent algorithms. They use the standard duality gap, but with the assumption that only a primal (or dual) algorithm is used, they generate the corresponding dual (or primal) point used in the gap. This is not a new idea (the authors do not claim it is, but should give more credit); rather, their main contribution is that under certain settings (f smooth, and g Lipschitz), the gap is finite, and furthermore, this bound on the duality gap converges at the correct rate, so that it is reasonably sharp. Related to this, the authors show a simple technique to make g Lipschitz if one can estimate a ball containing all the iterates. Clarity - Justification: The paper is mainly well-written, and most parts are locally reasonable to understand, though there are a lot of theorems/lemmas and perhaps due to the numbering scheme, they do not seem that organized. There is a clear structure but it’s a hard paper to fully comprehend for the reader. Significance - Justification: I think this is a good contribution, as it is a duality gap that actually should reflect performance, and can be used in many situations. It can be agnostic to the algorithm. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper is strong and I like the idea. The analysis requires that f^* is strongly convex, which should be stated early in the paper; this is an assumption, but a mild one since many losses f have Lipschitz continuous gradients. In particular, line 161, the authors point out that SDCA and extensions requires g to be strongly convex, but given the ability to switch perspectives from (A) to (B) (line 210,212), this is no different than requiring f^* to be strongly convex, which the current paper requires. You could say that SDCA requires smoothness and strong convexity. I’ve seen other papers that use duality gap approximations, as a tool to certify optimality, but with no results showing the gap itself must go to zero, so I am pleased that this paper shows the relevant results. The tests show convergence of the duality gap, but it would be nice to show duality gap and true error in the objective, so we can see if they converge at the same rate. Minor comments: -Line 120, “depending” should be “dependent” -Line 184, “- By” is ungrammatical. -Line 188, another approach here is to use the proximal point algorithm to systematically reduce the effect of the L2 term. -Table 1 seems to have issues, mainly with the loss. I don’t see why you define l_alpha and l_w separately, since if you just define l and then evaluate at either alpha or A^T w, you have the appropriate form. I would change notation and make the Loss table with only one column. Also, the Loss sub-table has a logistic loss. On the left column, the summation is j=1, …, n, and I think the “n” should be “d” (that difference is the only you reason you may want to keep two columns). Furthermore, if you have the logistic regression loss, (to be concrete, for Elastic Net (A) f=l_alpha), then f^* in the upper table (Elastic Net (B)) is not valid — it seems the duals of the losses were assuming only least-squares loss. - Equations (1a)(1b)(2a)(2b) are bizarre; you claim these are derived by applying standard duality results twice (once to primal, once to dual), which I believe but think is silly. I think you are confusing things. Equations (1a) and (1b) are redundant, as are (2a) and (2b), because it is well-know that \partial f = (\partial f^*)^{-1} for f in \Gamma_0. I’m not sure what you are getting at here. If you want to keep both, you could write (1a) and (2a), and then comment that equations (1b) and (2b) follow. - Footnote 3, line 273; if your function is not lsc, I would think you need a “sup” not a “max” here. Furthermore, in assumption 1, could you not assume that g is globally lsc and change cont(g) to dom(g) which is more standard? [if g is not lsc, then the minimization problem may not be well-defined] - Line 549, “save” to “safe”. - The references should capitalize “SDCA” and “SVM” in the titles. - The appendix cites about 4 books for facts that are all contained in any one of those books, e.g., Bauschke and Combettes. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors consider composite convex optimization problems with strong duality assumptions. The main question is to obtain primal-dual guaranties which consists in estimating the number of iterations of a given algorithm required to obtain a (duality-gap) suboptimality certificate bellow a given threshold. First the author work with strongly convex or bounded support regularizers. In this situation, they show how classical convergence rate estimates translate into primal dual estimates. Then the authors show that for general convex regularizers, it is possible to build a primal-dual pair of problems such that classical convergence rate estimates also translate into (modified) primal-dual estimates. They detail the application of this idea to norm regularized problems. Finally the authors propose a new analysis of the coordinate descent method when the regularizer is separable, with strongly convex or bounded support components. Clarity - Justification: Overall, the writing of the paper is acceptable. Major remarks/questions: - Assumption on f should be presented in a fair way. - Section 5.1 should be improved. - Why is there a specific analysis for coordinate descent? Significance - Justification: I think that the paper contains interesting ideas and is relevant for the machine learning community, specifically on the application side. The topic is relevant from a practical point of view and the way the authors address the question contains a fair amount of novelty. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): ~~~ Assumptions on f The three first pages of the paper state that f is an arbitrary convex function while at the end of Section 3 and for the rest of the paper, it is taken to be smooth. I do not think that this is a fair presentation. ~~~ Lemma 1: The case \mu = 0 is not correctly handled in Lemma 1. Take an example in dimension 1: - A = 1 - f: x -> x^2/2 - g: x -> |x| - f* = f - g*(z) = 0 if |z| <= 1 and plus infinity otherwise. Take x = 2. The expression of (3) is not dual feasible. Hence inequality (5) does not make sense and u is not well defined. The reason why the proof does not work is the use of Fenchel inequality (in equality case). This equivalence applies only if the subgradient is a non-empty set (we need to pick one element in the subdifferential). However, the subgradient of a proper convex function is empty outside the domain of the function. Hence, the application of Fenchel inequality in (32) is not valid because it should be checked first that A^Tw(\alpha) is in the domain of g* (or that the sudifferential is non empty). Everything works fine when the domain of g* is the whole space: g strongly convex or with bounded domain. This remark also applies to Lemma 14. Futhermore, the proof of Lemma 1 could be more detailed. ~~~ Section 5.1 can be improved: The link with the original problem should be more specific. The relation "original primal-dual structure" => "Bound on the sequence" => "New primal-dual structure" => "Bound on the sub-optimality of the original problem" does not appear clearly enough in my opinion. In particular, it is not stated anywhere that the computation of the duality gap requires to solve a (potentially expensive) problem. Theorem 6 is a bit vague. The application of Theorem 3 and 4 only provides convergence results related to a "new" problem, which is different from the original problem: the primal-dual structure is not the same and the structure of the optimal set is also different. There is no discussion about how to use it in practice, how it relates to the original problem and how to compute the duality gap, appart from the discussion about norm regularized problems. Corollary 7 is not meaningfull. What does it mean to apply Theorem 6 for L = Bt? There is no explicit description of how this translates to convergence rates (neither for the original problem, nor for the "sequence" of new problems). There is no control on the value of Bt and therefore, the corresponding rate could be arbitrarily bad. ~~~ Norm regularized problems How do the author compute the duality gap in section 5.2? What is the computational cost for the examples described in this section. I am not sure that the elastic-net is best placed in Section 5 since this is a strongly convex regularizer which is treated in Section 4. ~~~ Section 6 I do not understand the interest of having a specific analysis for coordinate descent. The assumptions are the same as those of Section 4? Isn't it possible to simply apply results of Section 4 to one of the existing analyses of SDCA? Does it provide better convergence results? ~~~ Minor comments - Harmonize the problem form (g(\alpha) at the beginning) - Notations in table 1 are not very explicit. - g is sometimes \gamma, sometimes \mu strongly convex - The notation $D(t) \sim O(t)$ is non standard and should be explicit. - In 4.1.2, g* is sometimes Lipschitz and sometimes Lipschitz on its domain (Theorem 3) which is wrong. - Proof of Lemma 14: Some superscripts and subscripts disapear in the middle of the proof which make it unreadable. =====