Paper ID: 103 Title: Minimum Regret Search for Single- and Multi-Task Optimization Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper introduces a novel Bayesian Optimization objective. Like Entropy Search, its reasoning involves the belief over the location of the extremum induced by the GP over the objective itself. In contrast to the aforementioned method, the proposed new objective does not aim to reduce the entropy over the minimum's locations, but the expected regret, under this belief over the minimum, of evaluating at the most likely value of the extremum in the evaluation following the current one. The idea being that this encourages the method to explore "dangerous" locations, which may contain a particularly large optimum, first. This is a well written paper, about a significantly novel idea in a hot topic of machine learning. The experiments are convincing, and the idea is intriguing. I have only minor comments below, which I hope the authors will address in the feedback. Clarity - Justification: The paper is written clearly, in clean English, and the experimental results are presented well. Significance - Justification: While there has been a lot of buzz around Bayesian Optimization lately, novel acquisition functions are rare (there are only about 4 or 5 in wider use at the moment). This paper arguably introduces another one, albeit one related to an existing one. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I only have comments about the presentation of related methods, and BO in general. I hope the authors will address all these. * The abstract and introduction sound like Bayesian Optimization is always about minimizing regret. This is not true. It's about using a probability measure over the objective function to reason about evaluation points. In particular, the Entropy Search algorithm is decidedly not designed to minimize regret. The paper by Hennig & Schuler even contains an experiment to show that ES is "sub-optimal" in terms of regret. And one can construct special cases in which ES does the exact opposite of minimizing regret (e.g. if the objective is known to be a quadratic function, (with only two unknown parameters, an extremum at zero, evaluated with noise), then ES evaluates in regions with expected high signal to noise ratio, i.e. areas believed to be far away from the extremum. ES is about collecting information about the extremum, with no care about the actual collected function values. In fact the authors may want to use this aspect as an argument *for* their method: In situations where regret matters, ES is just not the right algorithm. * Lines 194-203: It is unclear at this point whether PES is actually cheaper than ES, as the authors claim. The paper by Hernández-Lobato et al. primarily claims that the PES objective is a closer approximation to the expected change in Entropy over the extremum than the formulation by Hennig & Schuler, not necessarily that it is faster. PES requires conditioning on, then marginalizing over, the location of the global optimum. This requires repeated global numerical optimization of GP samples, which can be taxing (the same holds for the treatment of hyperparameters. Each evaluation of hyper-parameter likelihoods requires re-locating a global extremum). Since this paper doesn't even use PES at all, claims about the strength of that algorithm should not be made without experimental evaluation. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors propose a method for Bayesian optimization which is claimed to provide a more direct approach to finding a point with minimal simple regret after N iterations. Clarity - Justification: The paper is reasonably clear. Significance - Justification: The approach provides performance which is on-par with existing methods but not necessarily significantly better. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The main empirical results claimed by the authors show that MRS can do better than methods such as entropy search when compared using the mean regret (i.e. figure 2). This could also be a result of the more global exploration performed by entropy search, which is exactly the approach advocated by the authors. See for example the theoretical results of Bubeck and Munos or the work of Hoffman et al. "On correlation and budget constraints...", de Freitas et al on "Exponential Regret Bounds for GP Bandits...", or "Soare et al on Best arm identification in linear bandits". Note that although these works differ from the approach presented here they do advocate for more exploration, which can result in similar empirical findings. Finally, it is a bit strange that with this as the guiding principle the authors then return to a single-step prediction problem. Not that this approach is in any way wrong, but it does need some justification. This approach also seems somewhat similar to the Knowledge Gradient method, which should be discussed. Finally, it should be noted that the difficulty in marginalizing over hyperparameters when comparing ES methods does not exist for the IAGO approach of Villemonteix et al, although this approach is limited to using a discretization of the space. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes a new information-based acquisition function for Bayesian optimization (BO) called minimum regret search (MRS). It is related to entropy search and in the presented experiments performs more robustly. The basic idea of MRS stems from the notion that BO is usually used to identify an input that performs well. This optimum is then judged by its immediate regret (the difference between its output and the true optimal value). It only seems natural to optimize this quantity directly during the BO routine. After introducing the new acquisition function, the authors make the necessary approximation by simply sampling (which is not necessarily the best choice as it performed worse than EP in Hennig et al, but for the sake of simplicity it's OK here). Clarity - Justification: The paper is quite clear to read. Significance - Justification: I don't believe that the acquisition function is optimizing quite the right thing, and the experiments leave several questions open; details below. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The problem formulation of optimizing the expected regret reduction (ERR) at the best point \tilde{x_n} appears to have a fundamental problem: it fixes \tilde{x_n} and then tries to change the predictive distribution to maximally reduce the regret of this fixed \tilde{x_n} under the predictive distribution. (MRS has exactly the same problem, only that it does this for multiple fixed \tilde{x_n} in a Bayesian fashion.) The reason that I think this is broken is that MRS does not reward gathering information about \tilde{x_n} being suboptimal (which would make us choose another point). In the explicit illustration the authors give of how their acquisition function behaves, they argue: "Thus, sampling in this parameter range can reduce the expected regret considerably either by confirming that the true value of f(x) on x >= 3.5 is actually as small as expected or by switching \tilde{x}_{n+1} to this area if f(x) turns out to be large." I fully agree with this statement, but the problem with MRS is that it only targets the first of these terms (confirming that the current \tilde{x_n} is good) and not the second (gathering information that makes us change \tilde{x}_{n+1}). Experiments in a toy setting compare MRS against EI, UCB and ES, showing that the median and mean performance can yield a qualitatively different picture among the presented algorithms. MRS performs well in these settings, but I see three problems with these results: 1) It very much looks like UCB would perform better with N > 100 as its gradient is much steeper than that of the other methods, but the plots are conveniently cut off at N=100. That is unacceptable. Rebuttal question 1: What is the performance of UCB and MRS in Experiment 1 with N=200, in terms of mean and median? (Please do not answer that that experiment is too expensive to run, as it really shouldn't be.) 2) PES is not included. While an argument can be made that this is to compare apples with apples, this is a downside since it has been argued (even in this paper!) that PES is better than ES. 3) This only shows the case of no model mismatch. In practice, we of course always have *some* model mismatch, so we better evaluate what the model mismatch does to the methods. So, while some experiments without model mismatch are useful, there should also be some with it. Rebuttal question 2: how do the methods compare on the standard functions Branin, Hartmann 3 and Hartmann 6? The idea to minimize the regret suffered if BO would be stopped in the next iteration is clearly a good one, but it also addresses a different issue without explicitly talking about it: Which value should be returned as the result of BO? While Hennig et al. minimize the GP model, and algorithms like EI usually return the best seen observation, here the point with the minimal expected regret (according to the model) is returned. This issue is rarely addressed, but in my experience this alone can have a dramatic influence on the quality of the proposed optimum. This distinction between the policy that chooses the points to evaluate and a "recommender" for the final choice is, for example, addressed in Bubeck et. al (already cited, but for totally different reasons) in the bandit setting. Changing the acquisition function and the way the "solution" is selected simultaneously might lead to wrong conclusions. I would be very interested in the performance of ES with the point of minimal expected regret as the recommender, and vice versa how MRS does judged by the maximum of the posterior. Rebuttal question 3: do the authors have any experimental data on this? One issue that irritated me is the fact that in the first set of experiments GP-UCB is used while for the robotics task the parameter of UCB is arbitrarily set to 5. This seems inconsistent to me. Furthermore, the citation of Bubeck et al. here seems to be used to justify this value, but the paper has actually very little to do with choosing it for continuous input spaces. The second benchmark, the robot arm throwing a ball, is very interesting. I was trying to reproduce some of the results, but I could not find enough information to set it up properly. It would be very helpful if some source code (probably for MATLAB I guess) for the simulation could be made available somewhere. The Multi-Task section mentions that it is straightforward to generalize MRS to this case, but no explanation is given. It might be just me, but estimating the expected regret marginalized over all contexts (I assume that is the straight forward generalization?) seems to require more assumptions. For example, are all contexts equally likely, or should the training data be used as a representative sample to estimate this quantity? Furthermore, the handling of the GP hyperparameters is not explained at all, but their treatment (MCMC sampling, e.g.) can have a considerable impact on the quality of the model. Some minor remarks: - The introduction makes it sound like BO only works for expensive, noisy, non-convex black-box functions, whereas the first paragraph of section 2 is more diplomatic that it can be applied to these. I would rewrite the introduction to be in tune with this. - line 49: automatic machine learning (AutoML) is not quite the same as hyperparameter optimization; AutoML usually has lots of hyperparameters & structured spaces. Bergstra et al (2011) is a good citation for hyperparameter optimization, but for AutoML, Bergstra et al (2012) [ICML 2012 "Making a science of model search"] is a much better reference. Likewise, another standard reference for AutoML based on Bayesian optimization is [Thornton et al, KDD 2013, "Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms"]. - line 105: "black-back" -> "black-box" - line 111: "historic" -> "previous" - line 227: You state that minimizing the uncertainty of the location of the minimum and the common objective of minimizing the immediate regret are different, albeit related. Do you have a source for that? I intuitively agree with the statement, but without a reference it seems unjustified. - line 243: The first X should be a \cal X - lines 348ff: Doesn't recycling the samples drawn to estimate E_p(f) to estimate p^* on the representer points incur a bias? Can you account for that somehow, or have you compared that to drawing new samples? Or are you referring to the standard trick of using the same samples to evaluate the max. of the representer points for the prior and posterior? In summary, I very much like the topic of the paper, but I'm not convinced the acquisition function optimizes "the right thing" and I'm not fully convinced by the experiments. However, I am convinced there is something to gain by explicitly reasoning about the regret that will be obtained in the future and I strongly encourage the authors to pursue this direction further. I'm looking forward to reading the rebuttal. =====