Paper ID: 727 Title: A ranking approach to global optimization Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): please see comments Clarity - Justification: please see comments Significance - Justification: please see comments Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper deals with the problem of globally maximizing an unknown real valued function using a small number of queries for the value of the function. The problem is turned into an active ranking problem that tries to rank each input vector by the (hidden) function value using a small number of queries. Then, if the ranking structure of the unknown function is known, a selective-sampling-like approach is used to solve the ranking problem; if the ranking structure is unknown, a structural-risk-minimization-like approach along with exploration is used to adaptively estimate the rank structure. The adaptive approach is shown to outperform other optimization algorithms on synthetic datasets. The theoretical development in the paper is very solid and it is very interesting to look at the optimization problem from a ranking perspective. There are several questions: * It seems that the step of keeping the consistent rankers (R = { …}) is not always easy to implement. It is not clear whether this step restricts the practical use of the proposed algorithm. * There are many interesting optimization problems in ML that can use global optimization---for instance, optimizing weights in deep learning. Can the proposed approach be used for such problems? * The practical problem of deciding the ranking structure {R_N} seems to be unaddressed in this work. More guidelines and more details about how the experiments are done would be helpful. * The ending approach seems to be randomly picking one input vector to reduce the set of consistent rankers. In active learning, usually the set reduction can be done more aggressively by picking the input vector more strategically. Have the authors considered this possibility? * In condition 1., the \mu function is not defined ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This work proposes a global optimization algorithm whose goal is to find the best possible solution using only a limited number of function evaluations. This work achieves this by introducing the notion of ranking of functions. Basically, all cost functions whose range are different only up to a increasing function are defined to have the same ranking. Given a set of ranking rules that contains the ranking of the underlying cost function, the algorithm uses new samples in order to refine this set toward the true ranking rule, and also uses this elements of this set in order to guide how to choose the next sample point. The authors prove the convergence of the algorithm and give a bound on the number of iterations of the algorithm to identify the solution. Limited experiments on synthetic problems are presented. Clarity - Justification: The definition of ranking rule is at the heat of the paper, which unfortunately is not very clear to me. The first definition of ranking rule (called induced ranking rule) is presented in line 170, which has a discrete range (points are mapped to either -1, 0, or +1). In line 205, the ranking rule for continuous functions is defined using the "induced ranking rule" as stated in line 210. So it has to be of discrete range as well. Then how can you define polynomial ranking rules for them that map R^n to R? The range of polynomials is not discrete! Also the definition of continuous ranking rule is not clear because it uses the same notion, which is not defined yet, to define itself: "Let f be a real valued function. We say that f has a CONTINUOUS RANKING RULE if rf is in R, where R denotes the set of CONTINUOUS RANKING RULES.". From this I do not still understand what the form of a continuous ranking rule is. Significance - Justification: Although the problem of nonconvex optimization is very relevant to the learning community, I am not sure if that is the case for particular result presented in this paper. See detailed comments. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): My concern about the paper is that all presented algorithms involve a search over a set of ranking rules (e.g., Line 324 or Line 488). How do you actually implement this step of the algorithm? This seems like a huge search and it seems to be computationally intractable. The cardinality of the set of ranking rules associated with a set of functions seems to grow exponentially in the dimensionality of those functions. If that is the case, what the practical gain of the algorithm would be? Given the unclear computational cost of performing the search steps mentioned above and lack of reporting the running time of the algorithm in the limited set of experiments, I doubt the idea in the paper could be adopted as an alternative scheme for nonconvex optimization. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper is a novel take on (non convex and possibly even discontinuous) optimization where the idea is to focus on induced rankings of points in the domain. In this setting, one iteratively evaluates the objective function at new points, which allows it to prune out rankings that are inconsistent with the function evaluations that it has done already and to better focus its future evaluations on points that have higher rank according to a running set of consistent rankings. This is primarily a theoretical paper (with some small experiments at the end) whose goal is to prove that under certain regularity assumptions, the proposed algorithms are guaranteed to converge to the optimal point. In a sense, what ends up being proposed is a meta-algorithm, since one must work out specific details for certain classes of “ranking rules”, and the authors show interesting examples for polynomial ranking rules and union-of-convex ranking rules. Clarity - Justification: Overall this is a quite well written paper. Significance - Justification: From a practical standpoint, it’s difficult to say whether the proposed approach would actually be useful in application, but it certainly does bring a fresh perspective to problems that the ICML community cares about (e.g., hyper parameter optimization) and thus I’d be inclined to accept. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Here are a few more detailed suggestions: + Related work: There has been quite some work on Bayesian hyper parameter optimization in the community in recent years. + Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design by Srinivas et al + Practical Bayesian Optimization of Machine Learning Algorithms by Snoek et al, + Non-stochastic Best Arm Identification and Hyperparameter Optimization by Jamieson et al + Be more clear about computational complexity: I’m missing what’s the complexity (big-O), say for the polynomial ranking rules. Having these would give me a better understanding of how practical the proposed approach is. Moreover the authors should preface section 3 by saying that the algorithms are discussed first w/o worrying about computation. Otherwise the reader is left wondering how anything could possibly be practical. + Experiments: I recommend that the authors apply the method to a more relevant objective function in the ML community. An example might be to do hyper parameter tuning of a neural net (where one might need to adjust learning rates, dropout parameters, batch normalization parameters, etc). + A perhaps weird thought: when I think of optimization and how level sets are nested in each other, I think of Morse theory. Could there be a connection? + Line 509: Corollary->proposition + Def 4: x’ and x order are swapped from that in Def 1, which makes parsing a bit difficult. Also an example after Def 4 would be helpful.. + In the Setup: do we assume that f is bounded? Implicitly we seem to also assume that max_x f(x) exists, but this should be stated very early on. Also why do we care that the domain itself is compact and convex if we’re relaxing so many assumptions about f? Is this used at some point? + Notationally, the authors use the term “ranking structure”, which it turns out just means “collection of ranking rules” (I think). This should be defined clearly at some point. Related to this, at line 512 right before Def 5, “we … identify a ranking structure that contains the true ranking rule” sound very vague since the desired ranking structure by this objective can just be… all possible ranking rules right? =====