Paper ID: 882
Title: Pareto Frontier Learning with Expensive Correlated Objectives

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): Summary The paper is interested in multi-objective optimization (MOO) of expensive black-box objectives.  The proposed approach elaborates on Bayesian optimization: * the MOO is brought back to mono-optimization, via the hyper-volume indicator  (HVI); * GP-based Bayesian optimization is applied to the above, and the correlations between the different objectives is modelled through  adapting Bonilla et al. (2008) or Teh et al. (2004).  The contribution is to define a tractable approximation of the hyper-volume indicator based on GP, which enables the use of  Expected Improvement.  

Clarity - Justification: see detail

Significance - Justification: see detail

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): "It's not entirely clear how one can evaluate a Pareto frontier".  Actually, it is acknowledged that comparing two proposed Pareto sets is a MOO problem in itself.   It should be said earlier on that the focus of the paper is on L=3 objectives and at most n=100 samples. Also, that the  end-user is usually interested in *sampling* the Pareto front; the  diversity of the proposed solutions matters.  The switch from MOO to single objective optimization via the HVI is not the only option; another possibility is to directly model the Pareto dominance using a rank-based approach. See: I. Loshchilov, M. Schoenauer, M. Sebag: A mono surrogate for multiobjective optimization. GECCO 2010: 471-478. It would be interesting to include this work in the comparative evaluation.  The approach used to build an approximation of the HVI: I would be interested in knowing whether the authors considered the Log-Laplace transform and whether it would make sense to get an approximation of the probability to be in the desired interval.    The experiments must consider problems with non-convex Pareto fronts (much harder to find; it is well-known that convex Pareto fronts can be characterized using  linearization (weighted sum of objectives), and show the diversity of the points found in independent runs;  Details l. 518: from l. 692 A_22 = \rho

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents an approach for global optimization with correlated objective functions. 

Clarity - Justification: o I do not agree with the claim that relatively little research has been carried out on multi-objective Bayesian optimization. See, for instance, "A Bayesian approach to constrained single- and multi-objective optimization", Feliot et al., arxiv.org/abs/1510.00503, and references therein.  o The authors should avoid the abusive use of "we" (see, for instance, lines 174--178, line 270, 316...).  o the expected improvement should not be used in presence of noisy evaluation results (see, for instance, Picheny et al. A benchmark of kriging-based infill criteriafor noisy optimization.  2012)  

Significance - Justification: The approach proposed by the author is based on an expectation propagation approximation of the expected improvement in case of correlated GP models.  To me, the approach is sound.  The authors should consider publishing the code that was used to do the numerical experiments. The significance of their contribution would be greatly improved.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): For me, the authors should not claim that little research has been carried out on multi-objective Bayesian optimization. The reference section should be improved in this direction. 

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper introduces a new model/acquisition function into the global optimization framework of Bayesian optimization that targets multi-objective optimization. Most work in Bayesian optimization has been done on single-objective problems and the authors motivate the multi-objective case with a nice example. They describe the concept of hypervolume at length and review work they base on extensively.   The predecessor is an acquisition function which calculates the expected improvement in pareto volume (EIPV), but is limited to a probabilistic model which treats the objective functions as being uncorrelated. Their modification is to replace independent GPs for each output with a multi-output GP and to introduce an approximation that makes the computation of EI feasible under correlation. The limitation of the original method to the independent case is due to the fact that taking integrals over slices of multivariate Gaussian distributions is hard. The contribution of this paper is to overcome this limitation by proposing a simple approximation to the integral of multivariate Gaussian distribution. The authors use multi-task GPs and semiparametric latent factor GPs as surrogate models to model the objectives and their correlations.

Clarity - Justification: The paper is mostly well written, but in my opinion it does not handle its space well. It takes 4 pages until it gets to the new contributions and then only spends a bit over a column on the new contribution (3.2); even if we count its choices for the correlated GP (previously existing work) as new material, that's still only 1 page. And this page is too dense to really get all the details across. The experiments are varied (which is great) and verbose (which is not great; e.g., the authors could just refer to the literature for some multi-objective blackbox functions rather than giving the formulas). As the result of using 4 pages on old material and 1 page on new material, the old material is clear to me while the new one isn't. I can follow the derivation of the acquisition function and the need for an approximation, but I do not understand why a scaled Gaussian should be a good approximation, and how exactly it is similar to EP. This should be explained both in the paper and in the rebuttal. Also, it would be good to discuss the Multi-task GPs and the SLF GPs in more detail, and give their pros and cons other than computational complexity. This requires space, but since this is the actual new part, that space would be well invested and can be taken from intro & experiments.

Significance - Justification: The topic of the paper is quite important. Multi-objective optimization is a real-world problem present in the optimization of ML algorithms, for example trying to achieve maximal accuracy while obtaining predictions in real-time. So far, with Bayesian optimization this can only be tackled either with crude heuristics or by ignoring the interactions between the different objectives. This paper proposes a new acquisition function for multi-objective optimization which is both principled and takes the correlation between the two objectives into account. At the same time, the proposed method is not drastically new but a fairly direct extension of existing work.  The experimental section shows that the proposed methods compares favourably to EIPV and another competitor (ParEGO), but it fails to cite, discuss, and experimentally compare to other related works:  - Hernandez-Lobato et al, BayesOpt 2015: "Predictive Entropy Search for Multi-objective Bayesian Optimization".  - Zuluaga et al, ICML 2013: "Active Learning for Multi-Objective Optimization". This also compared to ParEGO and showed much improved performance.  As the authors acknowledge, there is also a rich literature on multi-objective optimization in the evolutionary computation community. The authors should therefore cite, discuss, and experimentally compare to some prominent work from that community. The natural choice would be NSGA 2 [Deb et al, IEEE Transactions on Evolutionary Computation, 2002, "A Fast and Elitist Multiobjective Genetic Algorithm"], which has over 16000 citations on Google scholar and is the thing to beat in multi-objective optimization.  Finally, I think the authors should acknowledge that all their experiments are bi-objective, not really "multi". Would the method also work for the "multi" case, or would the quality of the approximations deteriorate a lot?

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I rated both clarity and significance as below average, but both of them have a lot of potential. I think this will make for a very nice paper once the remaining issues are fixed, and I strongly encourage the authors to continue this line of work. But before publication I believe the paper should describe the new contribution in more detail and compare to the best possible baselines.  Misc: Line 90: strong results have also been obtained based on tree-based models, e.g., [Yamins et al, ICML 2013: "Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures"] and [Thornton et al, KDD 2013, "Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms"]. Line 211: the authors mention a potential problem of their approach but don't come back to this later. Can the authors please comment on this in the rebuttal? Line 414-420: This fact might not be well-known to everybody, please add a reference or explanation here to make the paper more accessible Line 490-494: The authors explain how to obtain derivatives of the acquisition function, but never use them later. Figure 2 should be on page 6, not on page 4 because it is referenced for the first time on page 6. The authors should explain why they chose the oka2, vlmop3 and dtlz1a test functions for multiobjective optimization. I do not fully understand the neural networks benchmark and why it has only two objectives, namely accuracy and 'a combined measure of memory consumption and training time of the neural network', and not three objectives with the combined measure split into two objectives. Did the authors not implement their method for more than 2 objectives?

=====