Paper ID: 337 Title: Hyperparameter optimization with approximate gradient Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper uses ideas from implicit differentiation to solve the hyperparameter optimization problem for a class of smooth functions using gradient descent. The trick in this paper is to use an approximation to the gradient based on partial convergence (up to an error tolerance) rather than having to rely on exact gradients of the inner optimization problem. Combined with an adaptive step-size method, this approach empirically works very well on a set of simple hyperparameter optimization problems. Clarity - Justification: The paper is very well written, with the basic algorithm, theoretical, and empirical justification outlined in a straightforward manner. Significance - Justification: The algorithm is simple enough that it can be easily implemented, and seems to work much better than other hyperparameter optimization algorithms for a specific class of smooth problems with a convex inner optimization. There are a number of important areas where this algorithm does not apply however; namely noisy functions, non-convex models and discrete hyperparameters. Regardless, for the narrow class of problems it does cover it could be the method of choice. Also, the ideas presented in this paper could present a good starting point for approaching more challenging cases. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I think this is a great idea and for certain problems this approach makes a lot of sense. As was mentioned in the paper, this method could potentially be useful in high-dimensional settings. To really showcase this, one suggestion would be to replicate the logistic regression experiment of Maclaurin et al., ICML 2015, section 3.2, where a separate L2 penalty is learned for each individual parameter. Some minor comments: How does one find the Lipschitz constant C of g? I couldn’t find it in the paper and it is essential for the algorithm, but perhaps I just missed it. Please make this more obvious. The exponentially decreasing sequence with e_1=0.1 is very strong and I think that this practically means that the inner problem needs to be solved almost exactly within a few iterations. To be clear, my guess is that solving the inner problem to within e_k with that aggressive schedule would probably result in almost no change in e.g., classification performance compared to solving the inner problem exactly. If this is incorrect then perhaps an experiment showing the classification error at a tolerance of e_k vs exactly solving the problem would be helpful. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes an inexact gradient-based hyperparameter optimization algorithm. It performs gradient descent wrt the hyperparameters, but the gradient is allowed be within a controlled error sequence away from the true exact gradient. The paper shows the convergence of such algorithm and demonstrated its usage over a few datasets. Clarity - Justification: The paper is nicely written and easy to read. Significance - Justification: The algorithm allows the gradient wrt hyperparameters to be approximate thus it reduces the computational time of the inner optimization. An analysis is provided for the convergence of such inexact gradient method. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The main contribution of the paper is allowing inexact gradient during the hyperparameter optimization. During the outer loop, the algorithm maintains a summable sequence of error tolerance and the inner optimization is only required to be solved within that tolerance. Such a scheme is supposed to reduce computation time. One concern is about the validity of Assumption 2 during hyperparameter search. As long as the regularization parameter is positive, the Hessian is guaranteed to be positive. However, during the search, the regularization parameter may get arbitrarily small, and it may lead to an arbitrarily small minimum eigenvalue for the Hessian. And it is possible the minimum eigenvalue may decrease in the same rate as the error tolerance. In this case, the term eps_k ||A^{-1}_k|| may not be well-behaving. Another issue is the possible gain of computation may be small. The algorithm only reduces computation if the error tolerance is relatively large. If the error tolerance decreases exponentially, it then quickly enter the region where it requires the same amount of computation as other "exact" methods. It will be interesting to analyze how different error tolerance strategies balance the computational savings and the requirement to decrease to zero. From the experiment, the computational savings appear to be limited considering the generalization performance. And the experiments do not fully demonstrate the advantages of gradient-based hyperparameter optimization with only two hyperparameters to adjust. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents a method that performs hyperparameter optimization based on approximate gradients. It views hyperparameter optimization as a bi-level optimization problem and alternates between top-level (hyperparameter) and lower-level (parameter) optimization. In particular, it runs the lower-level optimization up to epsilon optimality, with epsilon decreasing over time, and computes approximate gradients. Clarity - Justification: The paper is very well written. Nevertheless, some parts remain a bit unclear. For example, I would appreciate more details about how steps (i) and (ii) are implemented, how much time is spent in step (i) and step (ii) in practice, whether there were any adaptive stopping criteria for the L-BFGS step for the blackbox methods, etc. The authors do not abide by their own notation, which would make x\in R^p a capital X. Algorithm 1 has a bug: step (iv) should set \lambda_{k+1) not \lambda_k, since this is needed in the next iteration. In the first step, \lambda_1 is not defined. How is grad_{1}^2 h computed in practice? This isn't described for the example applications, either, which would improve clarity. Still, overall, clarity is above average. Significance - Justification: The problem is important, but there are many issues, detailed below. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): At first glance, this paper appears to be a clear accept: it tackles an important problem, introduces a new method for solving it, proves the method's convergence, and shows that it works in some experiments. However, at second glance, while I like the authors' approach, there are many issues, precluding acceptance in the paper's current form: - Not a huge problem, but rendering the theoretical part quite irrelevant: the experiments are for a different algorithm than the theory: + adaptive step size (no experiments are shown without) + the computation of the Lipschitz constant makes the algorithm oscillate in practice - I'm confused about one point. How do the authors know that they are within \epsilon_k of optimal? Is it necessary to know the value of the optimum to run the algorithm? (That would obviously render the algorithm inapplicable in practice.) - There are experiments for logistic regression and kernel ridge regression, but both have problems. For kernel ridge regression the method worked worse than random search (dataset 3). And logistic regression only works since the authors redefined the loss function to a smooth function. That's not OK, one can't just change the problem, one has to change the method to work for the problem! Of course, if the authors only used a smooth loss *internally* in HOAG, but used the standard 0/1 loss in the comparison against other methods that would be fine. Is that what the authors did? - The authors tested 3 tolerance decrease strategies; 2 of them oscillated on 2/3 benchmarks they tested, so they only called the third one (exponential strategy) HOAG and compared that against the baselines. That's overfitting to the 3 datasets used for developing the method. The exponential strategy also has two magic constants (the 0.1 and the 0.5 in 0.1 * 0.5^k) which had to be set somehow (likely also using the same datasets). In contrast, for the baselines, the authors took defaults and did not perform any tuning. This calls for new datasets to evaluate the method properly. - The oscillation of 2/3 tolerance decrease strategies appears to be an indication that the theoretical results do not translate to good performance in practice. - The Bayesian optimization code the authors used as a baseline is under development and has not yet been whetted. The authors should try a standard implementation like Spearmint instead (Snoek et al, 2012). - If I understand correctly, in their own algorithm, the authors terminated the lower-level computation early to save time, but all the baselines had to run for the entire 100 steps, without any early stopping criterion. That's an unfair comparison: since their own algorithm is allowed to stop early, the baselines should as well. In terms of Bayesian optimization, the method shares this idea with Freeze-Thaw, so this would be the natural thing to compare against. If that's not possible, at least some simple heuristics should be tested, such as using the same series of \epsilon_k as HOAG in Bayesian optimization, or using half the steps, etc. - The authors should compute error bars for their plots, randomizing over different initializations, especially since the runs only take seconds. - The authors should properly cite the software packages they used, not just give them as websites (scikit-learn, etc). - As the authors mention, they only consider the noise-less case and it would be non-trivial to extend their approach to a noisy scenario. This makes the approach inapplicable for modern deep neural network task with mini batch optimization (arguably the most important application area for hyperparameter optimization) and -- while it's a first step -- somewhat reduces the potential impact of the paper. Rebuttal questions: 1) How do you know in step (i) of HOAG that you are within \epsilon_k of optimal if you don't know the value of the optimizer? 2) For logistic regression, did you use the standard 0/1 loss to compare HOAG to the baselines (and only internally use a different loss in HOAG)? 3) Since I fear there was overfitting to the datasets used for developing HOAG (see above), I would like to see an evaluation on new datasets. Optimally, just download 20+ datasets from UCI (10 classification, 10 regression), run all approach on on all of them, and put a table into the rebuttal summarizing the results (using 0/1 loss for classification). Computationally, that should be easy since all the runs only take on the order of seconds. (If the different data formats in UCI are annoying, there's also openml.org). Good results in that setting would be much more convincing than what's in the paper. 4) After how many steps did the lower-level runs in HOAG stop (in the beginning and in the end)? It would be good to have a ballpark number to relate it to the 100 done for the blackbox approaches. 5) How do the results compare to Bayesian optimization with Spearmint, using (1) entropy search and (2) expected improvement? (Code for these is available online (https://github.com/HIPS/Spearmint) and the authors should try this to compare against a whetted implementation, not one under development.) 6) Do you have any experiments on some more expensive hyperparameter optimization problem? In practice, it does not make much of a difference whether we need to wait for 2 seconds or 5 seconds, and from the paper I did not get any intuition on how the approach scales. =====