We like to thank all reviewers for your valuable feedback, which will be taken into account when revising our paper.$ Assigned_Reviewer_2 We like to clarify that our work does not simply go from diagonal to a block-diagonal noise structure. This is a special case of our work when B=0. When B>0, the noise structure becomes a dense matrix (eq.1) with a B-block-banded inverse, which complicates our distributed implementation considerably; existing GP literature on hyperparameter learning does not apply to dense noise structure. To answer some of your questions: (b) Our general framework applies to any arbitrary choice of noise kernel and is in fact developed independently of the noise kernel in eq.9; choosing eq.9 reduces our framework to the special case of LMA with known hyperparameters (LMA does not do hyperparameter learning). (c) Our inducing inputs are heuristically selected from training data. They can also be treated as optimizing variables in eq.11 but this increases our algorithm's runtime per iteration by |S| times which is not efficient even for small S. Nonetheless, our experiments show that optimizing the kernel parameters solely already yields better performance than that reported by the state of the art. (d) Section 4.3: We like to clarify that the work of Titsias(2009b) only relates SGPRs with SPARSE covariance of conditional prior (e.g.,DTC,FITC) to variational approximations (VA) of FGPR with SPARSE noise structures. Our unification result with LMA generalizes this idea nontrivially in two ways: (1) Existing SGPRs with DENSE covariance of conditional prior (e.g.,LMA) can also be cast as VA of FGPR with DENSE noise structures and (2) are still amenable to parallelization for achieving scalability. (e) Our proposed dPIC and dLMA do not see more than dDTC in the sense that all of them have access to same training dataset, use the same training/test partition, and do not adopt transductive setting. (f) Table 1: We follow the comparison protocol of Deisenroth & Ng(2015) by using the same sizes of the partitions and averaging the empirical results of each method over multiple runs wrt the best configuration of its optimizing procedure. We believe quoting such best-reported RMSE results for comparison is fair. Assigned_Reviewer_4 We like to clarify, respectfully, that you may have misinterpreted the focus of our work here, which does not require performing inference on a Gauss-Markov (GM) process. Different from your cited GM works, we impose the Markov property on the noise structure instead of the latent function. Consequently, we only exploit the resulting B-block-banded structure of the noise covariance (Lemma 1) to perform scalable inference/hyperparameter learning on a GP (not GM) (Theorems 1&2). This explains why we do not cover the existing GM literature which are remotely relevant. We relate this B-block-banded structure of noises to the covariance of a GM process only as an intuitive understanding since they are similar in spirit. This is perhaps misleading and we apologize for this. Methodological innovation: An important result of our framework is the unification with LMA (Section 4.2): It strengthens the original idea of Titsias(2009b) which only casts SPGP/FITC (Snelson and Ghahramani, 2006) with diagonal covariance of conditional prior as variational approximation (VA) of FGPR with INDEPENDENTLY distributed noises. Specifically, our unification result with LMA generalizes this idea nontrivially in two ways: (1) Existing SGPRs with DENSE covariance of conditional prior (e.g., LMA) can also be cast as VA of FGPR with DENSE noise structures and (2) are still amenable to parallelization for achieving scalability. Practical significance: Our framework allows more complex SGPR models to be derived that are variationally optimal for large datasets with correlated noises, as recognized by first reviewer. In fact, it is more general than LMA as it enables GP practitioners to encode their domain knowledge of observation noises by customizing the noise kernel (Section 2) while still preserving the scalability of hyperparameter learning via distributed variational inference (Sections 3 & 5). Assigned_Reviewer_5 To answer some of your questions: (a) Titsias’ VI derivation has an additional trace term which also appears in our eq.4. Optimizing eq.4 wrt q gives eq.6. Section 4 then details how the noise covariance S_DD can be set to recover the exact posteriors of inducing outputs of existing SGPRs (including FITC). The trace term acts as a regularizer for optimizing hyperparameters, which is not included in the original definition of GP approximations. (b) We do not tie the parameters of different variational variables together in the sense that our variational distribution q(f_s) does not assume any parametric form in advance. Using variations of calculus, we show (Appendices B & C) that the optimal q(f_s) is Gaussian.