Paper ID: 759 Title: Black-box Optimization with a Politician Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper provides a new method for solving smooth, unconstrained convex optimization problems. It basically combines gradient descent methods with analytic center methods. It provides a number of experiments to show superiority of this hybrid algorithm over state-of-the art general purpose methods. Clarity - Justification: The paper is very well written and self-contained. It is lightweight and I really enjoyed reading it. It contains all the necessary information without the unnecessary technical/mathematical nonsense that one usually finds in ICML papers. Significance - Justification: I liked the paper. It has a simple message. Combine standard gradient descent methods with standard cutting plane methods / analytic center methods and get the best out of both worlds. It performs best when the function is almost piecewise linear with smooth bends in between. (like smoothed hinge-loss type functions) Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The idea of the paper is quite simple and nice. However, a few minor caveat exist. It would be really nice to see a proof of convergence for the method, i.e., as suggested by Claim 1. changed after reading rebuttal: ( Also, the claim that CG, BFGS, and BFGS+ are optimal for quadratic functions, i.e., they output the minimal point in the span of all previous gradients is not true (Section 5.2.) This is only true for CG (if there are no round-off errors). This is not true for BFGS or BFGSB+, nor for any of the other known quasi-Newton type methods. Hence, what is meant by optimal method in Figure 1? ) True: For exact line search they are identical. I missed that. Sorry. Typos: - figure 1 -> Figure 1, for others as well - problem 1 -> Problem 1, for others as well - line 410: call -> calls ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes a generalization of the classical "oracle" framework for optimization algorithms (first-order algorithms, in particular), discussed in (Nesterov, 2004). The new "politician" framework incorporates all past points, function values, and gradients (i.e. first-order information) that has been used, whereas an oracle only considers the current point. The politician framework is particularly well suited for objectives where computing the gradient is expensive. The paper presents an example method, based on a recent first-order geometric method (Bubeck et al., 2015), using the paper's new "politician" framework. The authors implemented (or used existing implementations of) a variety of classic first-order methods. Some methods were completely standard, and some additionally used the "politician" framework. The paper presents numerical examples showing favorable characteristics of the example methods that used politicians instead of just oracles. Clarity - Justification: At a coarse level, I think the paper is well organized. However, there are some clarity issues. There are some details (see Detailed Comments) that are missing that I believe would impact an expert attempting to reproduce the results in this paper. The paper uses a very slight abuse of notation (which the paper mentions), which I feel adds to the clarity of the paper (otherwise too much notation!). Significance - Justification: The "politician" framework appears to be a natural generalization of the "oracle" framework, and this paper appears to be novel work. The new framework allows one to use the first-order method separately from the use of any "historical information" that one can use to modify the updated point coming from the first-order method. This may prove as a useful framework for combining first-order methods with other optimization methods. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper presents interesting ideas, and if most of the issues mentioned below can be fixed, I'd be supportive of accepting. 44-45: I think the abuse of notation is acceptable. It's used consistently. 130: Should the R^m be R^n? You describe x in R^n but then apply T(x). 134: I'm not sure that the notation "i \in [k]" was defined, although I think I understand what is meant. 188-201: It took me a minute to "digest" the definition of the balls and their intersection. Perhaps some motivation or intuition would assist in reader's understanding? 205-207: I don't know if this observation is totally correct. We want the radius of B(z,alpha,fval) to be <= ||1/alpha gradf(z)||, which occurs if fval <= f(z). So is this claim true for all z or just for z = y_i, because fval = min_{y_i} f(y_i)? Probably a minor detail, and perhaps I'm wrong! 245: Does "past queries" refer to the y_i, or the previous queries from the first-order method? I presume it's the former, as I think the latter would violate the definition of a politician. 248: Perhaps it is worth emphasizing that B(y_i, alpha, fval) depends also on gradf(y_i)? 250-251: I don't know an algorithm to compute this efficiently; is this a standard result? Seems plausible, but I'd like a reference or brief explanation (or both!). How does this affect the computational complexity? 254: Does the line search go over the whole line or just the line segment between first-order query and volumetric center? 307: I did not verify Prop 2. Can the authors present some evidence (perhaps numerical) demonstrating that the gradient and Hessian are correct? Is there a standard reference for this computation? See 338-341 as well. Perhaps the computation could be shown in an appendix? 321: What's the matrix D? D = diag(d)? I didn't see D defined anywhere. 338-341: Is this a classic result? I can see it for the analytic center, but not for the volumetric center. Is there a reference for the implementation of this or a similar problem? See 307 as well. 348: Section 5 of the paper used "geometric oracle" often instead of "geometric politician" (e.g. 392, 394). If this was intentional, the intention was lost on me. 396: There are variants of Nesterov’s method that do keep track of all past gradients. See, e.g., P. Tseng “On Accelerated Proximal Gradient Method…” In a related note, "how much history" do each of these methods use? 434: I don't think the paper defined the line_search functions (other than it's an exact line search), although the reader will most likely infer their meaning and syntax. 440: Alg 3 and 4 both appear to use the politician Phi_f. Are these algorithms for GK+ and BFGS+? Maybe this Phi_f is the geometric _oracle_. I'm confused by its presence here. 491: Is this claim a direct extension of the result for Nesterov's difficult function, or is it just obvious? 532: (4) should have b_i - a_i^Tx? 566: In Figure 1, PCG and Nes appear with the same line color and style (same with SD and Geo, GK and GK+, TFOCS and optimal). I tried two different PDF viewers and got the same result. Figure 2 appears fine. [Could you also add a few symbols to each line?] 579: Perhaps point the reader to Dolan and Mor\'e "Benchmarking optimization software with performance profiles"? 598: You used t=10^-4 for "non-smooth" examples. As I understand it, these "non-smooth" examples are technically smooth, but perhaps with high condition number (i.e. "kappa = beta/alpha" in Bubeck et al., 2015). The paper mentions that the politician framework is particularly well suited to cases where the gradient is expensive. You ran some timing experiments; can you briefly discuss the proportion of time spent computing gradients versus the time the politician spends? 225: Start new paragraph here. First page, 2nd column: a similar type of idea for improvements is when approximately solving a quasi-Newton method. Another particular example is S. Laue, “A Hybrid Algorithm for Convex Semidefinite Optimization”, ICML 2012. 410: “call” to “calls”. Section 5.3 and related figure: “worst” sounds odd, maybe “adversarial” is better ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): Authors study strongly convex optimization, in a setting in which the algorithm relies on the combination between any first order method and a "politician", which uses the history ( the previous iterates and gradients) to improve the current iterate by performing a line search between the current iterate and a center of a feasible set. They describe a conditions under which this is computationally advantageous, as computing the gradients is more expensive than the extra cost due to politician, and provide an example of such a politician (the geometric politician) which satisfies the affine invariant condition. Intuition is supported by experiments, comparing convergence of a sample of first order methods, with or without politician, but authors do not give theoretical convergence rate. Clarity - Justification: Overall, the paper is well written and intuitive insights are provided. There are few typos and experiments are well detailed, as well as the following discussion. Minor comments. The choice of the name "politician" might be a bit misleading, as the politician indeed "changes the question and answer a new query", but it gives a better answer that the expected one, which does not seem to be the average situation in politics. It seems it would rather be a historian, even if clearly less funny. In definition 2, line 132, shouldn't x be in \R^m? Line 312, definition of d is unclear. Significance - Justification: Even with convincing experimental results, the paper does not provide convergence result, exception made of a claim in the discussion section, which gives a theoretical result for SD+, but with a different center, and without proof neither in the paper nor in supplementary material. The combination of the first order algorithm and the geometric approach is definitively interesting but most possible theoretical results are stated as open problems. Topic seems definitively promising but lacks theoretical results in current state. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): As stated above, the paper is well written and interesting but lacks theoretical elements. Giving more intuition on the meaning of an affine invariant politician could be an improvement. Similarly, more explanations on proposition 2 and and the differences of convergence rates between analytic and volumetric center would be helpful. =====