We thank all reviewers for their feedback and constructive comments. To begin, we wish to address an important concern, common across the reviews: that our work is somewhat incremental and that our unique contributions are unclear. We respectfully disagree: $ First, while it is true (and not surprising) that the idea of preconditioning kernel matrices (including in GPs) has been discussed previously, there has been no investigation that is even reasonably thorough. Here we provide a comprehensive discussion of preconditioning, beyond existing work, and definitively illustrate that preconditioning accelerates learning in kernel machines without resorting to approximations. If our community ever hopes to use preconditioning for kernels, a thorough piece of work like this is required. Second, we want to stress that our work offers the following elements that are entirely new: - We consider spectral and SKI approximations, which are important modern techniques. This inclusion is especially significant for SKI, where we identify a few shortcomings which hinder its application to preconditioning; - We show how to use preconditioning in non-conjugate GP models, including again an empirical evaluation against common alternatives; - We use ADAGRAD in SGD, which is also novel to GPs and enables truly scalable GP regression; - We make the first use of stochastic trace estimators and preconditioning within the Laplace approximation to compute stochastic gradients for non-conjugate GP models, which appears to be the way forward when dealing with large datasets (Flaxman et al. (2015) propose CG for computing the Laplace approximation only - no stochastic gradients); - Perhaps most importantly, we use all of these novelties to create performance improvements that are far superior to related work, where only minor gains in speed were observed. Again, this finding is essential to move the field towards using preconditioning. Taken together, we strongly believe this work to be a novel and valuable contribution to the literature. For completeness, we also detail here responses to the individual comments: * Rev1 With reference to Srinivasan et al, although we duly included their approach in our comparison of preconditioners, that work focuses exclusively on the impact of regularized preconditioning on GP regression. Our results are also much more competitive, but we agree that this warrants more discussion. - Line 725: We confirm that there is some confusion in this statement which needs to be clarified. We progress with the Nystrom preconditioner because it consistently performs better than other preconditioners and is extremely easy to implement. Subsequently, we chose to have O(n^{1/2}) points in order to preserve the overall complexity of the PCG method as O(n^2), such that it matches plain CG. - Line 731: For ease of presentation, we decided to fix N_r = 4 and the step-size of ADAGRAD to one for all experiments. The fast convergence of ADAGRAD without tuning these parameters suggests robustness to the choice of the step-size and indicates that the variance of the stochastic gradient is low despite using only four vectors for trace estimation; we think that these are particularly alluring features of our proposal. - Line 758: We acknowledge that a fairer comparison would have been obtained if all experiments were run in R. Although an R version of GPStuff is available, this is in fact just a wrapper for its octave implementation. Additionally: + we ensured that GPstuff uses the same multi-core linear algebra routines utilized by our code; + it is evident from the plots that the approximation methods provided by GPStuff all seem to plateau at a degree of error which is far inferior to that obtained using exact methods. We reiterate that this analysis demonstrates that our proposal is competitive (and may even be superior) to approximation methods available in popular software packages. We also demonstrate that preconditioning makes stochastic gradient learning of GPs faster than learning using exact gradients; this is not the case without preconditioning. * Rev2 As highlighted in the foreword, we believe the paper also demonstrates novel elements making it worthy of consideration. We also agree that the use of preconditioning in spatial statistics should be acknowledged in the paper. * Rev3 We agree that our statements regarding the novelty of the Laplace approximation used in this paper could be misleading. We address these in the foreword to this rebuttal and will clarify in the paper. Although the complexity of PCG is indeed no different from that of CG, we emphasize that experiencing a 2-fold or 5-fold (in some cases even an order of magnitude) improvement can be very substantial when plain CG takes very long to converge or when the dataset is large. * Rev4 We thank the Reviewer for the positive feedback, insights and suggestions for future work.