Paper ID: 188
Title: A Distributed Variational Inference Framework for Unifying Parallel Sparse Gaussian Process Regression Models

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents a distributed variational framework for Gaussian process regression with non-isotropic noise (allowing for diagonal and correlated noise structures). The general variational framework is similar  to that of Titsias (2009a,b) and its distributed counterpart by Gal et al (2014). The main claims of the paper is that of (a) having scalable framework for inference in these types of models, applicable to big data; and (b) unifying parallel/distributed version of common sparse GP regression methods. Experiments are carried out at medium (~40K training points) and large (~2M training points) scales.

Clarity - Justification: The paper is well written and provides and very concise and clear summary of the state of the art in scalable sparse GP regression (SGPR) approaches . The paper is fairly easy to follow for a reader familiar with sparse GP methods, leaving out most tedious derivations for the appendix. The paper is also very clear on its main contributions, that of providing a parallel/distributed implementation of sparse GP regression methods, allowing for correlated structures in a “unifying” framework.  One small suggestion to improve the clarity of the paper is to change the notation for the inducing variables, as some confusion may arise from calligraphic S and the noise covariance S. Most previous papers on sparse GPs use either u or f_u.  

Significance - Justification: Given the recent advances and interest in scalable GPs, the paper presents a significant contribution, allowing for more complex SGPR approaches to be applied to large datasets. Unlike the authors’ main claim, I see this contribution mainly as a practical one (yet significant). The key issue here is that the main “unifying” idea is not novel and it is described for a FITC model by Titisias (2009b, Appendix C). Going from a noise structure that adds a diagonal term from the conditional prior to a structure that adds a block-diagonal term is hardly a novelty. Although usually everything in retrospective looks straightforward, it is clear that such a noise structure must arise from placing (conditional) independence constraints in the noise process. Although the authors do mention this (line 123), I believe they should give more credit to the original paper that presents this idea in section 4.   The development in terms of the parallelisation that exploits the structure in the noise process seems sound.  

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): (a) I think the authors could do a better job at explaining LMA and how it relates to the other sparse methods.  (b) The authors mention in line 205 that the noise covariance can have different parameters to the prior covariance but their method relies on defining the noise structure based on the conditional prior. Could the authors please explain this? Can their method apply to more general noise processes? If so, have they performed experiments to substantiate this claim?  (c) Are the location of the inducing points optimized in the present method and the baselines considered?   (d) Section 4.3. There is nothing mysterious about the relation between the applicability of variational DTC (and generalisations) to more complex noise structures. This is explained in Titsias (2009b).  (e) In line 696, the authors mentioned partitioning the training and test data into M blocks. This brings the questions if the method proposed gets to see more than the other benchmark methods. For the more complex models, is their method in a transductive setting? If so, this does not seem to be a fair comparison.   (f) Along similar lines, the results reported on Table 1 for the benchmark methods are taken from their previous publications. Are the training/test partitions exactly the same?   (g) One of the main advantages of using GPs is that they output a full predictive distribution. I believe every single paper in GPs should evaluate the predictive distribution using things like the negative log predictive density (NLPD). This paper lacks this evaluation and only evaluates the methods using RMSE. For me, this is a major deficiency of the paper.  

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The authors propose distribution variational inference for several sparse Gaussian process models, exploiting a correlated Gauss-Markov noise process.  They associate different sparse GP models with different noise correlation structures, providing a unifying view.  They evaluate performance on several datasets.    

Clarity - Justification: The paper is clearly written, but it misses a large body of GP literature exploiting Gauss-Markov structure.  See, for instance, many of Arno Solin's papers on GP state space models, such as: http://jmlr.org/proceedings/papers/v38/solin15.html http://arxiv.org/abs/1506.02267 Or Chapters 2 and 4 of Yunus Saatchi's PhD thesis:  http://mlg.eng.cam.ac.uk/pub/pdf/Saa11.pdf  

Significance - Justification: There appears to be little methodological innovation and the interpretation of correlated noise for these old and well known models is already somewhat known (see, for example, the original SPGP paper by Snelson and Ghahramani, 2006) and does not seem to have many practical implications. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper is nice to read, but it does not appear to have much methodological innovation or practical significance.  I think it's a nice paper but perhaps not suited to the constraints of a conference format, since the authors are lightly touching on many points which are already somewhat known in the community, rather than pursuing a focused practical narrative.  The paper also seems to miss most of the existing literature on Gauss-Markov processes.   

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): Through the definition of specific structured noise processes, the authors cast many GP approximations as performing VI. This extends on the derivation of Titsias (2009) for DTC and FITC. The view allows the authors to derive distributed inference procedures for a large class of GP approximate models, extending on the work of Gal et al. (2014) with DTC. 

Clarity - Justification: The paper is clearly written and well-organised. The result and the proofs are clear enough so that expert readers will understand them. The paper is accessible to those in similar areas.   The score for "Above Average" seems to be defined as "Accessible to most machine learning researchers", with "Below Average" defined as "Likely to be accessible only to those in similar areas", leading to my choice of "Below Average" (but this is common to the field of GP approximations in my opinion). 

Significance - Justification: Other researchers are likely to use the ideas in the paper.  

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Major comments: * Titsias's VI derivation does not recover FITC exactly - it has an additional KL term between the approx dist and the prior. I think your derivation would have similar terms. How do these correspond to the original definitions of the different GP approximations? * Also, Titsias's FITC derivation is quite problematic from the VI perspective. It ties the parameters of different variational random variables together. This allows for an arbitrary manipulation of the lower bound (in principle), which goes against the underlying principles of VI. This suggests that VI should probably not be used to explain FITC. How does this relate to your other approximations? I think most of them tie variational parameters together. * You claim that the derivations allow us to select an approximate model based on the desired noise structure. The Gaussian Markov random process defines a sequential dependence between sets of points. Does this make sense in high dimensional spaces? Even on a plane I find it difficult to see when the assumptions made by this noise process are justified (mostly because we need to define an order over tuples to decide which depend on which) * It might make it easier for the reader if you were to relate key equations to the standard ones in Titsias's work (for example, relating the matrix P to the precision parameter tau*I) * It would be nice to read section 4.3 paragraph 2 in the context of Titsias's work (ie - make a clear distinction between the contribution of this paper using Titsias's ideas, to Titsias's own work) * What's the number of inducing points used for each model in section 6? * The results you show are an average of multiple runs. Could you provide std as well?   Minor comments for the authors: * "lack of rigorous approximation" - weird phrasing * GP on lines 84, 654 should be GPs * your noise definition suggests that it can take negative values? * in eq. (1), what is S_DV on the RHS? * on line 285: KL does not define a distance metric, but rather a divergence * on line 419, blkdiag is not defined * in the citations, Gal et al., "gaussian" should be "Gaussian"  The paper is solid, interesting, and adds value to the ICML community. 

=====