We thank the anonymous reviewers for their insightful comments. We will gladly address their concerns and suggestions in the final version. $
All three reviewers suggest to show computational savings on problems with more than 2 hyperparameters. For this, we replicated the setting of Maclaurin et al. ICML 2015, Sec. 3.2, a problem with 7840 hyperparameters. With the same setting as previous experiments the value of the objective function after 10 min was: 9053 (HOAG) < 9510 (Iterdiff) < 10372 (Random search) < 10877 (SMBO). After 30 min of computation these were: 8029 (HOAG) < 8069 (Iterdiff) < 10468 (SMBO) < 10372 (Random Search). HOAG reached first the 10^-2 suboptimality at 45 min. These results further emphasize the benefits of gradient-based HO and HOAG's cheap early iteration. Full results in terms of suboptimality and generalization (loss on validation set) will be included in the final version.
Reviewers 1 and 3 raise the point that the exponential decrease sequence defaults within few iterations to the "exact method". Intuitively, this should not lead to big gains in performance. Instead, what we observed in practice (Fig.2) is that when tolerance reaches the level of exact methods (at around 25 iterations), the current iterate is within a neighborhood of the solution. In this regime, the change in hyperparameters is small and the method converges fast using warm-starts. Furthermore, in practice the gain in generalization performance beyond 10 iterations is negligible (Fig 3 bottom).
# Reviewers 2 (R2) and 3 (R3) raised other points to which we reply individually:
R2: How do you know in step (i) of HOAG that you are within \epsilon_k of optimal if you don't know the value of the optimizer?
This is problem dependent. For mu-strongly convex inner optimization problems (as the ones used in the experiments) we used the classical inequality mu||X(lambda) - x_k|| \leq ||g'(lambda_k, x_k)||. We will include this important implementation detail in the final version.
R2: For logistic regression, did you use the standard 0/1 loss to compare HOAG to the baselines (and only internally use a different loss in HOAG)?
We used the log loss for the plots "objective minus optimum" (Fig 2 and 3 upper row) in order to compare methods wrt their objective function. In the plot Fig. 3 bottom row we display instead the generalization performance and as such we display the 0-1 loss, which we agree is the natural metric in this case.
R2 mentions that oscillation of tolerance decrease strategies appears to be an indication that the theoretical results do not translate to good performance in practice. This is not correct. As other approximate gradient methods, HOAG is globally convergent but _not_ monotonously decreasing. In fact, some oscillations are expected when the decrease sequence does not match the convergence rate of the inner optimization method (see e.g. [Friedlander and Schmidt 2012]). We acknowledge that some phrases in the current text might lead to confusion which we will improve in the final version.
R2: the authors terminated the lower-level computation early to save time, but all the baselines had to run for the entire 100 steps.
This is not correct. They were run until the desired tolerance was achieved (using the same criteria as the "exact" HOAG) or it reached 100 iterations.
R2 also suggests to compare our algorithm with a fixed step size. However, contrary to the lower level problem, computing the Lipschitz constant is not feasible in practice, other than evaluating f for all the values in its domain. In other words, gradient-based HO algorithms are not practical without an adaptive step size strategy. This is in our opinion an important issue largely ignored in previous literature.
Finally, R2 suggests to use Spearmint as the method for SMBO. In fact, Spearmint was dropped during the experimental validation due to its very poor performance. BayesianOptimization was faster by at least a factor of 2, although we did not investigate if this was due to the difficulty of implementing warm restarts in Spearmint, to its database overhead or other reason.
Other (R2): notation is consistent, x (lowercase) refers to the vector, X uppercase refers to the vector-valued function. lambda_1 is defined in section 4.2. Thanks for spotting the typo on \lambda_{k+1}.
R3 questions the validity of assumption 2 when the Hessian becomes ill-conditioned. However, our method (and all its analysis) is based on the constrained setting, i.e., the regularization parameter is constrained to lie in an interval, e.g. [10^-6, 10^6]. For the case of L2 regularization this ensures that the smallest eigenvalue is >= 10^-6, verifying assumption 2. We note that 0 is not a valid value in our experiments since that would mean lambda=-infinity, violating assumption 3.
We believe to have addressed all points raised by R2 and R3. We kindly ask them to reconsider their decision.