Paper ID: 1129
Title: Preconditioning Kernel Matrices

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper performs an exhaustive empirical evaluation of a wide range of  kernel matrix preconditioners in terms of their impact on the convergence of the preconditioned conjugate gradient algorithm used to solve linear systems arising in Gaussian process regression. The insights are applicable to many other kernel machines.

Clarity - Justification: The paper is well written and easy to follow; the experiments are concise and conclusive.

Significance - Justification: While the idea of using preconditioners is not novel, there was no empirical evaluation of th respective gains of the possible options. The paper somewhat closes this gap. It is clear that MVMs with kernel matrices remain expensive i.e. O(n^2) hence, one can argue that dense kernel matrices are in general doomed to fail in providing scalable algorithms but there is still some value. The bold claim "We present the first approach" in Section 2.2 is certainly not true. At least Flaxman et.al. [i] use LCG together with the Laplace approximation but others certainly have done this before.       [i] http://jmlr.org/proceedings/papers/v37/flaxman15.pdf  

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.):           - A) Figure 1: The author add symbols to "facilitate b/w print" but they could have chosen a colormap that helps as well, see http://bids.github.io/colormap/.          - B) Equation (2) is missing the noise variance.          - C) Introduction: The authors should be more explicit about the O(n^2) scaling of kernel machines with exact dense covariance in general and avoid the temptation of suggesting that the use of PCG can provide anything more than reducing the number of O(n^2) operations i.e. bring down the constant.          - D) line 802: "meth ods".          - E) line 184: "Traditionally ..." This is an odd sentence. Cholesky is used for its speed, simplicity and stability.

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The idea of preconditioning kernel matrices kernel machines is proposed to help with exploiting CG schemes in approximation schemes. Illustrative examples are provided to demonstrate the utility of the approach.

Clarity - Justification: Nicely written and clear - notch effort required to understand the substance of the paper.

Significance - Justification: This is an incremental advance for kernel machinist scale up to large data sets. Preconditioning has been used for quite some time in spatial statistics - something that the authors do not seem to be aware of. Nevertheless it is a useful contribution for kernel machines and their deployment.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): A nicely written and illustrated paper that suggests a way to improve kernel machines scaling -overall the ideas are good and would makeanice contribution to the conference.

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.):          This paper is about performing the model selection for GP in a more effective way, based on established tools of solving linear systems such as the preconditioned conjugate gradient. The main contributions of this paper are 1) in the framework of preconditioning kernel matrices, subsume various sparse GP methods into this framework by writing them as different preconditioners; 2) using stochastic gradient learning to optimize the marginal likelihood with stochastic approximation of the trace term; 3) conduct extensive experiments comparing different preconditioners.        

Clarity - Justification:          The paper is mostly readable but the presentation can be definitely improved. See detailed comments below.       

Significance - Justification:          It is a promising paper and I like the sections where the authors nicely summarized various sparse GP approaches as different preconditioners in a scholarly fashion. Other than this, unfortunately, I don't see any significant contribution.       

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.):          Given that (Scrinivasan et al., 2014) is highly relevant, I'm surprised that there's no discussion about this work.         Line 113: which example? non-Gaussian likelihoods in 2.2? and I don't see the discussion on "making predictions in GPs."                 Line 154: write out \theta = \{\sigma^2, l_r^2\} here.         Line 167: it would be great to provide several examples of $h$ here, for example, Probit regression or Poisson regression.          Line 171: since the primary focus of this paper is about model selection of GP, you should make it clear at the beginning.          Line 212: can you add a discussion on the sample complexity of this approximation?          Section 2.2: I don't see the novelty here. The Laplace approximation seems to be standard, the marginal likelihood can be found in standard textbooks such as (Rasmussen and Williams, 2006, see (5.20)).         Line 332: after the equation there's a comma missing. Similarly for line 390.         Line 697: Given *the* predictive mean and ...         Line 725: "In view of the results obtained ...", I don't know what this sentence mean.                 Line 731: why do you choose N_r = 4? how does this affects the approximation error?          Line 758: Are you also using the R version of the GPstuff? If not, how would the running time of different methods be comparable?         

=====

Review #4
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The main contributions of this paper is to argue for the use of pre-conditioning in the solving of kernel based systems. The running example used is Gaussian Process regression, but the thesis in the paper applies to any kernel machine that can be solved via conjugate gradient (which covers the majority of popular kernel methods). They give examples of general purpose pre-conditioners for GPs that outperform efficient Cholesky decomposition implementations, but which have the potential to scale much higher, due to not requiring an explicit representation of the kernel matrix. Importantly, they also point to a reframing of approximate kernel methods as pre-conditioners, which allows for the practical speed increases of the approximation within a framework that leads to the true solution.

Clarity - Justification: The paper is very clear and well explained.

Significance - Justification: While this paper doesn't introduce novel techniques, it has substantial practical value. Pre-conditioning of linear systems is of critical importance - it is in fact the most important element in the performance of numerical algorithms - yet is barely mentioned in most literature surrounding GPs and other kernel machines. As these methods are scaled up, it seems almost certain that they will be using gradient based optimization techniques rather than relying entirely on full or even partial decompositions. If this is the case, it is a critical insight that the wealth of approximation techniques for kernel methods can be directly incorporated into iterative solvers as pre-conditioners and provides a strong foundation for further developments in this area.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The key premise of the paper is an important one, and the paper does a good job of communicating this. The experiments provide some good justification for the practical applicability of this technique - especially using wall clock time. I can understand the use of SE kernel as the running example for the paper, though I think some experimental impact may be lost from not considering others that are more computationally tractable. Overall I think the direction is strong and considering parallelized approaches allowed by CG would provide even stronger experimental results.

=====