Paper ID: 775 Title: Extended and Unscented Kitchen Sinks Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper extends the framework of Steinberg & Bonilla (2014) for Gaussian processes with arbitrary link functions, by accommodating both multiple outputs and random feature approximations to enable scalability. A variational inference scheme is proposed. The model strikes me as under-motivated, the novelty is somewhat limited, and the experiments underwhelming. Clarity - Justification: The paper was reasonably clearly written. One minor criticism is that Section 4 could make it clearer around (8) and (9) how the model derives from the random features approximation of the previous section. Significance - Justification: The authors correctly identify in Section 8 the difficulty of evaluating their paper: its novelty is limited to combining a few existing ideas (random kitchen sinks, multi-output GPs, the EGP/UGP). To a member of the community familiar with the EGP/UGP, the combination would not seem particularly significant. Further, I'm not entirely convinced by the approach's EGP/UGP foundations. In particular, the proposed variational inference scheme seems odd. Under the linearisation proposed, I think the model should be tractable: why is the variational inference scheme needed? There is some argument for this in the following passage. > The main advantage of their algorithms is that the linear approximation is local and adaptive, as it is constructed around the posterior mean for a single data point n, and it gets updated at every iteration of the algorithm as a function of the variational parameters. Nonetheless, the argument needs further development: under the linearisation, abandoning a closed-form expression for a further approximation (in the form of the variational scheme) seems to merit better motivation. I think the desirable updates the authors describe above would also result from treating the linearisation parameters as hyperparameters of the original model. The proposed model requires the observations to be Gaussian-noise-corrupted, as per (9). This is inappropriate for classification tasks, where observations are discrete. How was this managed in practice? The experiments don't provide compelling evidence for the proposed methods. Only one dataset, MNIST, is used to compare against the state-of-the-art for classification(for which Gal et al. (2014), HMG and DB seem like reasonable stand-ins). Here the results do not seem to make a particularly compelling case for the superiority of the proposed approach against existing techniques. Further, providing HMG with only 200 inducing points (even acknowledging that their locations must be learned) seems to provide an unfair advantage to the proposed approaches, which have 1000 and 2000 random features. The proposed approach does allow more generic link functions, as evinced by the seismic inversion example, but if this is to be the major contribution of the paper, it should be more richly illustrated. This latter approach, I feel, is probably the most profitable means of improving the paper. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): * "The variational evidence lower bound" <-- "The variational log-evidence lower bound" * "seminar" <-- "seminal" ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper presents the combination of the ideas from "extended"/"unscented" Gaussian processes and "random kitchen sinks" to construct a tractable algorithm for GP inference in settings with many datapoints and nonlinear likelihood functions. The building blocks and the algorithm itself are described in detail. The paper compares the performance to other methods on several experimental tasks: synthetic regression problems, handwritten digit classification and a real inference problem from geostatistics. Clarity - Justification: The paper is clear and concise, it has exactly the right level of detail for the description of the building blocks and the constructed algorithm. Everything is laid out in a way so that I could follow the steps without problems. Significance - Justification: The paper shows a combination of existing methods. It is a good contribution to the field, but not a fundamentally new approach. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This is a great paper and I enjoyed reading it! The paper is strong in presenting the building blocks in a clear way to the right level of detail. Also the algorithm itself is constructed such that it should be possible without big problems to reproduce it. Good work! I only have a few minor remarks: 1) The need for non-Gaussian likelihoods could be motivated or explained earlier. It became clear when discussing the nonlinear output function $g$, but it wasn't clear from the start. 2) I don't know the abbreviations COBYLA and BOBYQA, even though I am familiar with basic optimization literature. The problem is that there are many different optimization methods out there. I would suggest classifying the used methods within the control literature ("gradient-free optimization, e.g. ...") instead of just dropping the names of the software/algorithm. 3) One important detail is a bit hidden in the conclusion: That the big advantage of the UKS over EKS is not requiring an analytic derivative. I thought a bit about it and I wasn't sure what the use of the UKS is, since it is "just" a different linearization and sometimes even worse than the EKS. It could be emphasized in Sec. 5.2. that it doesn't require any derivative to compute it. 4) The paper states multiple times that the new method is much faster and reduces the computational complexity of the EGP/UGP idea. Unfortunately, there are no timing results reported in the paper, and the reader might be interested in the actual numbers compared to the other algorithms, since some parts of the algorithm (EM-like iterations until convergence, numerical gradient-free optimization, numerical quadrature) are quite costly as well. Typographic issues: 1) The typesetting of abbreviations in small caps is not consistently done throughout the paper. E.g. line 377 has a tall caps "EGP" and also "COBYLA" and "BOBYQA" are tall caps. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper develops a new approach for multitask GP regression with non-Gaussian likelihoods with variational inference and random Fourier features through two different approximations to the likelihood. Clarity - Justification: I found your presentation to be clear, but would have liked more justification for the various choices in the model and inference strategy. Significance - Justification: Scalable methods for multitask GPs with non-Gaussian likelihoods is an important problem. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Thanks for an interesting paper. I agree that it is surprising that more has not been done with RKS-based GP models, but there are a few papers which consider this which you haven't cited, e.g. Dai et al "Scalable Kernel Methods via Doubly Stochastic Gradients". My main concern surrounds the use of variational inference. Since you focus on combining RKS methods with non-Gaussian likelihoods, it's not at all clear to me that the use of variational inference makes sense as a first approach. With explicit features, non-Gaussian likelihoods are tractable, aren't they? Indeed, you mention as future work using a stochastic gradient optimiser to learn all model hyperparameters. I think it'd also be worthwhile to consider MCMC methods, since the structure of your model seems like it might address the issues usually faced with MCMC and GP methods (correlations between parameters / hyperparameters makes sampling difficult.) Beyond a discussion of why variational inference, it'd be good to quantify what you're losing when you use it, and your seismic inversion problem is a great place to start. Since you've run MCMC, I'd like to see not just the comparison between posterior means, but also a comparison between posterior uncertainties, and I'd also like to see posterior distributions for the parameters in the model, with comparisons to MCMC where that's possible. Basically, I'm wondering whether the variational approximation is under/over stating the uncertainty, by how much, and whether that's scientifically relevant. I'd also like to see a discussion of connections between your approach and the Laplace approximation, which has been used extensively for GPs. =====