We thank the reviewers for their valuable feedback. Below we clarify the most important issues. $
* Reviewer_1 *

We will incorporate all the reviewer’s suggestions concerning (1) the need for non-Gaussian likelihoods; (2) naming optimization algorithms and (3) advantage of the UKS.  With regards to time complexity, the main advantage of our methods is that they can be applied to problems where the original EGP and UGP are simply unfeasible, due to their cubic scalability on the number of datapoints. 

* Reviewer_2 *

(i) Related work: The work by Dai et al (NIPS, 2014) is certainly relevant in their use of random features for kernel machines, although their main focus is on non-probabilistic approaches that are underpinned by convex optimization. 

(ii) Variational inference: As stated in the introduction, one of our motivations is inversion problems, for which we may have a very large number of observations along with complex nonlinear forward models. As mentioned in sec. 7.4, we implemented MCMC algorithms for these problems but found them too slow to be applied to very large datasets. In contrast, our variational methods can be orders of magnitude faster than MCMC. Furthermore, we believe our work is essential to understand the effectiveness of random feature approximations to the original UGP/EGP (which are underpinned by variational methods), without introducing additional sources of variability and complexity by changing the inference algorithm. Nevertheless, further study and improvement of sampling algorithms (such as MCMC and ESS) is indeed a very promising research direction. 

(iii) What we lose in our approximation: Our experiments in sec. 7.1 and 7.2 indicate that our performance is comparable to the original (EGP/UGP) framework’s and to exact GP inference (when possible), in terms of both mean prediction and predictive uncertainties. When compared to MCMC in our inversion experiments, our methods tend to provide slightly higher predictive variances throughout (e.g. for depth being around 4% of the mean). From the scientific application perspective, such differences are not expected to have a high impact on decision making. 

(iv) Connection to the Laplace approximation: Because our linearization algorithms are local and adaptive, the EGP and the EKS can be seen as refined (iterated) versions of Laplace, where the linearization gets updated at every iteration as a function of the variational parameters (sec.  2.1). The UGP and the UKS give us more elaborate and effective ways to propagate the first and second moments through nonlinearities. We will discuss this in the final version. 

* Reviewer_3 *

(i) Limited significance: We disagree. Our methods make the EGP/UGP scalable, which is a major practical contribution. Furthermore, to the best of our knowledge, random-feature approximations had not been thoroughly investigated and evaluated from a probabilistic perspective. Our approach provides a new research direction for sparse GP methods, which until now have relied heavily on inducing-point approximations.

(ii) Odd inference scheme: We want to clarify a possible misunderstanding by the reviewer as closed-form posterior inference is *not possible*. The likelihood in our model is still given by Eq. (9), where g(.) is a *nonlinear* function. Linearization comes into place as part of our variational framework (not the model), where the parameters get updated at every iteration. Although one can try to approximate the marginal likelihood through linearization directly, it is unclear how to set the points where to linearize about.  In contrast, this all falls out naturally in our variationally framework.  We can also try running MCMC, see Assigned_Reviewer_2 -> (ii). 

(iii) Gaussian-noise: A Gaussian likelihood model on top of a nonlinear function g(.) is simply a trick that allows us to deal with generic (continuous or discrete) observations. For likelihoods for discrete data, one can think of a suitable g(.) such as a softmax function with infinite precision noise. In practice, we initialize and bound (during optimization) the noise variance to very low values, yielding proper distributions over the targets. The resulting distributions were of high quality, as shown by the NLP in our experiments (see e.g. Figure 2 and Table 2).

(iv) Not compelling evidence in experiments: We disagree. As mentioned e.g. in lines 573 onwards, Steinberg and Bonilla (2014) showed that the EGP and the UGP outperformed hard-coded inference methods based on variational inference and EP. Therefore, using exactly the same data and experimental set-up, our results indicate that our methods are competitive with such state-of-the-art algorithms. These comparisons were omitted from the submitted paper due to space constraints but can be squeezed in back if necessary.